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Abstract 

We study the problem of concealing functionality of a proprietary or private module when provenance 
information is shown over repeated executions of a workflow which contains both public and private 
modules. Our approach is to use provenance views to hide carefully chosen subsets of data over all 
executions of the workflow to ensure F-privacy: for each private module and each input x, the module's 
output f{x) is indistinguishable from F — 1 other possible values given the visible data in the workflow 
executions. We show that F-privacy cannot be achieved simply by combining solutions for individual 
private modules; data hiding must also be propagated through public modules. We then examine how 
much additional data must be hidden and when it is safe to stop propagating data hiding. The answer 
depends strongly on the workflow topology as well as the behavior of public modules on the visible data. 
In particular, for a class of workflows (which include the common tree and chain workflows), taking 
private solutions for each private module, augmented with a public closure that is upstream-downstream 
safe, ensures F-privacy. We define these notions formally and show that the restrictions are necessary. 
We also study the related optimization problems of minimizing the amount of hidden data. 

1 Introduction 

Workflow provenance has been extensively studied, and is increasingly captured in workflow systems to en- 
sure reproducibility, enable debugging, and verify the validity and reliability of results. However, as pointed 
out in [1_6|, there is a tension between provenance and privacy: Confidential intermediate data may be shown 
{data privacy); the functionality of proprietary modules may become exposed by showing the input and out- 
put values to that module over all executions of the workflow {module privacy); and the exact execution path 
taken in a specification, hence details of the connections between data, may be revealed {structural privacy). 
An increasing amount of attention is therefore being paid to specifying privacy concerns, and developing 
techniques to guarantee that these concerns are addressed ll30l [321 171181. 

This paper focuses on privacy of module functionality, in particular in the general - and common - 
setting in which proprietary {private) modules are used in workflows which also contain non-proprietary 
{public) modules, whose functionality is assumed to be known by users. There are proprietary modules 
for tasks like gene sequencing, protein folding, medical diagnoses, that are commercially available and are 
combined with other modules in a workflow for different biological or medical experiments ||2] [T]. The 
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functionality of these proprietary modules (i.e. what result will be output for a given input) is not known, 
and owners of these proprietary modules would like to ensure that their functionality is not revealed when 
the provenance information is published. In contrast for a public module {e.g. a reformatting or sorting 
module), given an input to the module a user can construct the output even if the exact algorithm used by 
the module is not known by users (e.g. Merge sort vs Quick sort). 

Following fTS'l, the approach we use is to extend the notion of i'-diversity (Tf) to the workflow setting by 
carefully choosing a subset of intermediate input/output data to hide over all executions of the workflow so 
that each private module is "F-private": for every input x, the actual value of the output of the module, /(x), 
is indistinguishable from F — 1 other possible values w.r.t. the visible data values in the provenance infor- 
mation (in Section [6] we discuss ideas related to differential privacy). The complexity of the problem arises 
from the fact that modules interact with each other through data flow defined by the workflow structure, 
and therefore merely hiding subsets of inputs/outputs for private modules may not guarantee their privacy 
when embedded in a workflow. We consider workflows with directed acyclic graph (DAG) structure, that 
are commonly used in practice f31, contain common chain and tree workflows, and comprise a fundamental 
yet non-trivial class of workflows for analyzing module privacy. 

As an example, consider a private module m2, which we assume is non-constant. Clearly, when executed 
in isolation as a standalone module, then either hiding all its inputs or hiding all its outputs over all execu- 
tions guarantees privacy for any privacy parameter F. However, suppose m2 is embedded in a simple chain 
workflow mi — > m2 — > mj, where both mi and /M3 are public, equality modules. Then even if we hide 
both the input and output of m2, their values can be retrieved from the input to nii and the output from mi,. 
Note that the same problem would arise if mi and m^ were invertible functions, e.g. reformatting modules, 
a common case in practice. 

In flSl l. we showed that in a workflow with only private modules (an all-private workflow) the problem 
has a simple, elegant solution: If a set of hidden input/output data guarantees F-standalone-privacy for a 
private module, then if the module is placed in an all-private workflow where a superset of that data is 
hidden, then F-workflow-privacy is guaranteed for that module in the workflow. In other words, in an all- 
private workflow, hiding the union of the corresponding hidden data of the individual modules guarantees 
F-workflow-privacy for all of them. Clearly, as illustrated above, this does not hold when the private module 
is placed in a workflow which contains public and private modules (a public/private workflow). In ifTSl we 
therefore explored privatizing public modules, i.e. hiding the names of carefully selected public modules so 
that their function is no longer known, and then hiding subsets of input/output data to ensure their F-privacy. 
Returning to the example above, if it were no longer known that mi was an equality module then hiding 
the input to m2 (output of mi) would be sufficient. Similarly, if was privatized then hiding the output of 
m2 (input to mj,) would be sufficient. It may appear that merging some public modules with preceding or 
succeeding private modules may give a workflow with all private modules and then the methods from [15] 
can be applied. However, merging may be difficulty for workflows with complex network structure, large 
amount of data may be needed to be hidden, and more importantly, it may not be possible to merge at all 
when the structure of the workflow is known. 

Although privatization is a reasonable approach in some cases, there are many practical scenarios where 
it cannot be employed. For instance, when the workflow specification (the module names and connections) 
is already known to the users, or when the identity of the privatized public module can be discovered through 
the structure of the workflow and the names or types of its inputs/outputs. 

To overcome this problem, we propose an alternative novel solution, based on the propagation of data 
hiding through public modules. Returning to our example, if the input to m2 were hidden then the input to 
mi would also be hidden, although the user would still know that my was the equality function. Similarly, if 



2 



the output of m2 were hidden then the output of would also be hidden; again, the user would still know 
that /M3 was the equality function. While in this example things appear to be simple, several technically 
challenging issues must be addressed when employing such a propagation model in the general case: 1) 
whether to propagate hiding upward (e.g. to m\) or downward (e.g. to WI3); 2) how far to propagate data 
hiding; and 3) which data of public modules must be hidden. Overall the goal is to guarantee that the 
functionality of private modules is not revealed while minimizing the amount of hidden data. 

In this paper we focus on downward propagation, for reasons that will be discussed in Section |3] Using 
a downward propagation model, we show the following strong results: For a special class of common 
workflows, single (private)-predecessor workflows, or simply single-predecessor workflows (which include 
the common tree and chain workflows), taking solutions for F-standalone-privacy of each private module 
{safe subsets) augmented with specially chosen input/output data of public modules in their public closure 
(up to a successor private module) that is rendered upstream-downstream safe (UD-safe) by the data 
hiding, and hiding the union of data in the augmented solutions for each private module will ensure F- 
workflow privacy for all private modules. We define these notions formally in Section [3] and go on to show 
that single-predecessor workflows is the largest class of workflows for which propagation of data hiding 
only within the public closure suffices. 

Since data may have different costs in terms of hiding, and there may be many different safe subsets 
for private modules and UD-safe subsets for public modules, the next problem we address is finding a 
minimum cost solution - the optimum view problem. Using the result from above, we show that for single- 
predecessor workflows the optimum view problem may be solved by first identifying safe and UD-safe 
subsets for the private and public modules, respectively, then assembling them together optimally. The 
complexity of identifying safe subsets for a private module was studied in [15] and the problem was shown 
to be NP-hard (EXP- time) in the number of module attributes. We show here that identifying UD-safe 
subsets for public modules is of similar complexity: Even deciding whether a given subset is UD-safe for 
a module is coNP-hard in the number of input/output data. We note however that this is not as negative 
as it might appear, since the number of inputs/outputs of individual modules is not high; furthermore, the 
computation may be performed as a pre-processing step with the cost being amortized over possibly many 
uses of the module in different workflows. In particular we show that, given the computed subsets, for chain 
and tree-shaped workflows, the optimum view problem has a polynomial time solution in the size of the 
workflow and the maximum number of safe/UD-saf e subsets for a private/public modules. Furthermore, 
the algorithm can be applied to general single-predecessor workflows where the public closures have chain 
or tree shapes. In contrast, when the public closure has an arbitrary DAG shape, the problem becomes 
NP-hard (EXP-time) in the size of the public closure. 

We then consider general acyclic workflows, and give a sufficient condition to ensure F-privacy that is not 
the trivial solution of hiding all data in the workflow. In contrast to single-predecessor workflows, hiding 
data within a public closure no longer suffices; data hiding must continue through other private modules 
to the entire downstream workflow. In return, the requirement from data hiding for public modules is 
somewhat weaker here: hiding must only ensure that the module is downstream-safe (D-safe), which 
typically involves fewer input/output data than upstream-downstream-safety (UD-safe). 

The remainder of the paper is organized as follows: Our workflow model and notions of standalone- 
and workflow-module privacy are given in Section |2] Section |3] describes our propagation model, defines 
upstream-downstream-safety and single-predecessor workflows, and states the privacy theorem. Section @T| 
discusses the proof of the privacy theorem, and the necessity of the upstream-downstream-safety condi- 
tion as well as the single-predecessor restriction. The optimization problem is studied in Section 14.21 We 
then discuss general public/private workflows in Section l4~n before giving related work in Section |6] and 
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concluding in Section |7] 



2 Preliminaries 

We start by reviewing the formal definitions and notions of module privacy from lITSl . and then extend them 
to the context studied in this paperQ Readers familiar with the definitions and results in lUSl can move 
directly to Section [3] 



2.1 Modules, Workflows and Relations 

Modules A module m with a set / of input data and a set O of (computed) output data is modeled as a 
relation R. R has the set of attributes A = lUO, and satisfies the functional dependency 1^0. We assume 
that / n O = and will refer to / and O as the input attributes and output attributes of R respectively. 

We assume that the values of each attribute a € A come from a finite but arbitrarily large domain Aq, and 
let Dom = riae/'^a CoDom = rTueO'^fl denote the domain and co-domain of the module m respectivelyJl 
The relation R thus represents the (possibly partial) function m : Dom — > CoDom and tuples in R describe ex- 
ecutions of m, namely for every f G /?, Ylo{i) = m(Jli{i)). We overload the standard notation for projection, 
YIa{R), and use it for a tuple t G /?. Thus n^(t), for a set A of attributes, denotes the projection of t to the 
attributes in A. 



Workflows A workflow W consists of a set of modules wii , • ■ • , m„, connected as a DAG (see, for instance, 
the workflow in Figure [Hi. We assume that (1) the output attributes of distinct modules are disjoint, 
namely Oi n Oj = 0, for / ^ j (i.e. each data item is produced by a unique module); and (2) whenever an 
output of a module m, is fed as input to a module nij the corresponding output and input attributes of mi and 
mj are the same. The DAG shape of the workflow guarantees that these requirements are not contradictory. 

We model executions of W as a relation R over the set of attributes A = U"^jA,-, satisfying the set of 
functional dependencies F = {!{ ^ Oi : i ^ [I)"]}- Each tuple in R describes an execution of the workflow 
W. In particular, for every ? G 7?, and every / G [1,«], no, (t) = m,(n/, (t)). One can think of R as containing 
(possibly a subset of) the join of the individual module relations. 

Example 1. Figure\J}shows a workflow involving three modules mi,m2,m3 with boolean input and output 
attributes implementing the following functions: (i) mi computes as =a\\la2, aA, = ^{a\f\a'i) and as = 
-i(fl:i©a2), where © denotes XOR; (ii) mj computes a6 = -i(a3+a4); and (Hi) m^ computes a']=aA,f\a^. The 
relational representation (functionality) R[ of module mi with the functional dependency aia2 — > aj^a^as 
is shown in Figure [7al For clarity, we have added I (input) and O (output) above the attribute names to 
indicate their role. The relation R describing the workflow executions is shown in Figure [7^ which has the 
functional dependencies aia2 — > aj,aA,as, aj,ai^ — > ag, aA,as — > a-] from modules mi,m2,mi respectively. 

Data sharing refers to an output attribute of a module acting as input to more than one module (hence 
n/y 7^ for / 7^ j). In the example above, attribute a4 is shared by both m2 and mj,. 

'The example in this section is also taken from 1 15 1. 

^We distinguish between the possible range O of the function m that we call co-domain and the actual range {y : 3x e / y = 
ra(x)} 
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(b) R: Workflow executions 
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Figure 1: Module and workflow executions as relations, and view 



2.2 Module Privacy 

We consider the privacy of a single module, which is called standalone module privacy, then privacy of 
modules when they are connected in a workflow, which is called workflow module privacy. We study this 
given two types of modules, private modules (the focus of [15]) and public modules (the focus here). 



Standalone module privacy Our approach to ensuring standalone module privacy, for a module repre- 
sented by the relation R, is to hide a carefully chosen subset H of 7?'s attributes (called hidden attributes). In 
other words, we project /? on a restricted subset A\H, where A is the set of all attributes of m. The set A \// 
is called visible attributes. The users are allowed access only to the view R' = Ylj^\jj{R). 

One may distinguish two types of modules. (1) Public modules whose behavior is fully known to users. 
Here users have a prior knowledge about the full content of R and, even if given only the view /?', they are 
able to fully (and exactly) reconstruct R. Examples include reformatting or sorting modules. (2) Private 
modules where such a priori knowledge does not exist. Here, the only information available to users, on the 
module's behavior, is the one given by R' . Examples include proprietary software, e.g. a genetic disorder 
susceptibility module. 

Given a view (projected relation) R' of a private module m, the possible worids of m are all the possible 
full relations (over the same schema as R) that are consistent with the view R'. Formally, 

Definition 1. Let m be a private module with a corresponding relation R, having input and output attributes 
I and O respectively. Let A = lUO be the set of all attributes. Given a set of hidden attributes H, the set of 
possible worlds /or /? with respect to H, denoted Worlds (/?,//), consists of all relations R' over the same 
schema as R that satisfy the functional dependency I O, and where Ua\h{R') = Ha\h{R). 

To guarantee privacy of a module m, the view R' should ensure some level of uncertainly with respect 
to the value of the output m(n/(t)), for tuples f € /?. To define this, we introduce the notion of F-standalone- 
privacy, for a given parameter F > 1. Informally, a view R' is F-standalone-private if for every t £ R, 
Worlds(/?,//) contains at least F distinct output values that could be the result of m(n/(t)). 

Definition 2. Let mbe a private module with a corresponding relation R having input and output attributes 
I and O resp. Then m is F-standalone-private with respect to a set of hidden attributes H, if for every tuple 
xGn/(^, I OUTxm// 1 >r, w/zere OUTxm// = {y I 37?' G Worlds (/?,//), 3t' G /?' s.t x = n/(t')Ay = 

no(t')}l! 

■'in 1151 , we (equivalently) defined privacy with respect to visible attributes V instead of hidden attributes H, and we used the 
notation "OUTx,m with respect to K" instead of OUTx.m./f- 
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If m is F-standalone-private with respect to hidden attributes H, then we call H a safe subset /or m and 

r. 

A module cannot be differentiated from its possible worlds with respect to the visible attributes, and 
therefore, whether the original module, or one from its possible worlds is being used cannot be recognized. 
Hence, F-standalone-privacy implies that for any input the adversary cannot guess m's output with proba- 
bility > j^, even if the module is executed an arbitrary number of times. 

Example 2. Returning to module m\, suppose the hidden attributes are H = {a2,fl:4} resulting in the view 
R' in FisureUc] For clarity, we have added I \H (visible input) and 0\H (visible output) above the attribute 
names to indicate their role. Naturally, R\ € Worlds(/?i,//), and we can check that overall there are 64 
relations in Worlds(/?i,//). 

Furthermore, it can be verified that, if H = {a2,a4}, then for all x € n/(/?i), [OUTx,mi,//| ^ 4, so 
{fli,a3,a5} is safe for mi and F = 4. As an example, when x = (0,0), OUTx.m,// 5 {(0)0)1). (0,i, 1), 
(1,0,0), (1,1,0)} (hidden attributes are underlined) — we can define four possible worlds that map (0,0) 
to these outputs (see [1?] for details). Also, hiding any two output attributes from O = {0:3, 04, as} en- 
sures standalone privacy for F = 4, e.g. if H = {fl'2,fl'4}, then the input (0,0) can be mapped to one of 
(0,0,0), (0,0, _1), (0,_1,0) and (0, 1_,_1); this holds for other assignments of input attributes as well. How- 
ever, H = {ai ,(22} (input attributes) is not safe for F = A: for any input x, Ovix.mfl = {(0, 1, 1), (1, 1,0), 
(1,0, 1)}, containing only three possible output tuples. 

Workflow Module Privacy To define privacy in the context of a workflow, we first extend the notion of 
possible worlds to a workflow view. Consider the view R' = n^\^(/?) of the relation /? of a workflow W, where 
A is the set of all attributes across all modules in W. Since W may contain private as well as public modules, 
a possible world for R' is a full relation that not only agrees with R' on the content of the visible attributes 
and satisfies the functional dependency, but is also consistent with respect to the expected behavior of the 
public modules. In the following definitions, mi , • • • , m„ are the modules in W and F = {/, — )• O, : 1 < / < «} 
is the set of functional dependencies in R. 

Deflnition 3. The set of possible worlds for the workflow relation R with respect to hidden attributes H 
(denoted by 

Worlds(/?,//) ) consists of all relations R' over the same attributes as R that satisfy (1) the functional depen- 
dencies in F, (2) n^\//(/?') - Ily^\^fj(R), and (3) no^.(t') - mi(Ilj.(t')) for every public module mi in W and 
every tuple t' G R'. 

We can now define the notion of F- workflow -privacy, for a given parameter F > 1 . Informally, a view 
R' is F-workflow-private if for every tuple f € /?, and every private module m, in the workflow, the possible 
worlds Worlds (/?,//) contain at least F distinct output values that could be the result of miijli. (t)). 

Definition 4. A private module in W is F-workflow-private with respect to a set of hidden attributes H, 
if for every tuple x G Uj.{R), |OUTx,w,//| > F, where OUT^.w,// = {y | 3/?' G Worlds(/?,//), s.t., V t' G R' , 

x = n;,(t')^y = no,(t')}- 

W is called F-private if every private module m,- in W is T-workflow-private. IfW (resp. m^) is F-private 
(T-workfiow-private) with respect to H, then we call H a safe subset /or T-privacy ofW (T-workfiow-privacy 
of mi). 

Similar to standalone module privacy, F-workflow-privacy ensures that for any input to a module m,, the 
output cannot be guessed with probability > ^ even if m, belongs to a workflow with arbitrary DAG structure 
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and interacts with other modules with known or unknown functionality, and even the workflow is executed 
an arbitrary number of times. For simplicity, the above definition assume that the privacy requirement 
of every module m, is the same P. The results and proofs in this paper remain unchanged when different 
modules m,- have different privacy requirements F,-. Note that there is a subtle difference in workflow privacy 
of a module defined as above and standalone-privacy (Definition O; the former uses the logical implication 
operator (=^) for defining OUTx,iy,// while the latter uses conjunction (A) for defining OuTx,m,//- This is due 
to the fact that some modules are not o?ifc0; and as a result the input x itself may not appear in any execution 
of the possible world R' . Nevertheless, there is an alternative definition of module m, that maps xtoy and 
can be used in the workflow for R' consistently with the visible data. 

2.3 Composability Theorem and Optimization 

Given a workflow W and parameter F, there may be several incomparable (in terms of set inclusion) safe 
subsets H for the (standalone) modules in W and for the workflow as a whole. Some of the corresponding 
R' views may be preferable to others, e.g. they provide users with more useful information, allow more 
common/critical user queries to be answered, etc. If cost{H) denotes the penalty of hiding the attributes in 
H, a natural goal is to choose a safe subset H that minimizes cost(//). A particular instance of the problem 
is when the cost function is additive: each attribute a has some penalty value cost(a) and the penalty of 
hiding H is cost(//) = roe//cost(a). 

On the negative side, it was shown in ifTSl that the corresponding decision problem is hard in the number 
of attributes, even for a single module and even in the presence of an oracle that tests whether a given 
attribute subset is safe. On the positive side, however, it was shown that when the workflow consists only of 
private modules (we call these ''all-private" workflows), once privacy has been analyzed for the individual 
modules, the results can be Ufted to the whole workflow. In particular, the following theorem says that, 
hiding the union of hidden attributes of standalone-private solutions of the individual modules in an all- 
private workflow guarantees F-workflow-privacy for all of them. 

Theorem 1. (Composability Theorem for All-private Workflows iflSl ) Let W be a workflow consisting 
only of private modules mi, • • • ,m„. For each i € let Hi C A, be a set of safe hidden attributes for F- 

standalone-privacy of mi. Then the workflow W is T-private with respect to hidden attributes H = U/Li tit- 

It was also observed in f\5\ that the number of attributes of individual modules can be much smaller than 
the total number of attributes in a workflow, and that a proprietary module may be used in many different 
workflows. Therefore, the obvious brute-force algorithm, which is essentially the best possible, can be used 
(possibly as a pre-processing step) to find all standalone-private solutions of individual modules. Then any 
set of "local solutions" for each module can be composed to give a global feasible solution. Moreover, 
the composability theorem ensure that the private solutions are valid even with respect to future workflow 
executions which have not yet been recorded in the workflow relation. 

Given Theorem [T] fT5l focused on a modified optimization problem: combine standalone-private solu- 
tions optimally to get a workflow-private solution. This optimization problem, which we refer to as optimal 
composition problem, remains NP-hard even in the simplest scenario, and therefore, [15] proposed efficient 
approximation algorithms. 

^For a function / : Z) — s- C, D is the domain, C is the co-domain, and R = {y :3x^ D,f{x) = y} is the range. The function 
/ is onto ifC = R. 
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3 Privacy via propagation 



Workflows with both public and private modules are harder to handle than workflows with all private mod- 
ules. In particular, the composability theorem (Theorem [T} does not hold any more. To see why, we revisit 
the example mentioned in the introduction. 

Example 3. Consider a workflow with three modules m\,m2 and as shown in Figure^^ For simplicity, 
assume that all modules have a boolean input and a boolean output, and implement the equality function 
(i.e., a\ = ai = aj, = a4). Module m2 is private, and the modules mi, mi, are public. When the private 
module mj is standalone, it can be verifled that either hiding its input a2 or hiding its output a^ guarantees 
T-standalone-privacy for F = 2. However, in the workflow, ifa\ and a4 are visible then the actual values of 
a2 and a^ can be found exactly since it is known that the public modules mi, mi, are equality modules. 

One intuitive way to overcome this problem is to propagate the hiding of data through the problematic 
public modules, i.e., to hide the attributes of public models that may disclose information about hidden at- 
tributes of private modules. To continue with the above example, if we choose to hide input a2 (respectively, 
output a^) to protect the privacy of module m2, then we propagate the hiding upstream (resp. downstream) 
to the public modules and hide the input attribute ai of nii (respectively, the output attribute 04 of 073). 

The workflow in the above example has a simple structure, and the functionality of its component mod- 
ules is also simple. In general, three main issues arise when employing such a propagation model: (1) 
upward vs. downward propagation; (2) repeated propagation; and (3) choosing which attributes to hide. We 
discuss these issues next. 

3.1 Upstream vs. Downstream propagation 

Which form of propagation can be used depends on the safe subsets chosen for the private modules as well 
as properties of the public modules. To see this, consider again Example |3l and assume now that public 
module mi computes some constant function {e.g., mi(0) = mi(l) = 0). If input attribute ^2 for module 
m2 is hidden, then using upward propagation to hide the input attribute ai of mi does not preserve the F- 
workflow -privacy of m2 for F > 1 . This is because it suffices to look at the (visible) output attribute a^ = 
of m2 to know that m2(0) = 0. In general, upward propagation from a subset of input attributes which gives 
Fi -standalone-privacy for a private module m will only yield F2-workflow-privacy for m, where Fi > F2. It 
is possible that Fi >> Fi unless upstream public modules are onto functions; in the worst case, if upstream 
modules are constant functions, then F2 = 1 whereas Fi can be arbitrarily large. Unfortunately, it is not 
common for modules to be onto functions (e.g. some output values may be well-known to be non-existent). 

In contrast, when the privacy of a private module is achieved by hiding output attributes only, using 
downstream propagation it is possible to achieve the same privacy guarantee in the workflow as with the 
standalone case without imposing any restrictions on the public modules. Observe that safe subsets of 
output attributes always exist for all private modules - one can always hide all the output attributes. They 
may incur higher cost than that of an optimal subset of both input and output attributes, but, in terms of 
privacy, by hiding only output attributes one does not harm its maximum achievable privacy. In particular, 
it is not hard to see that hiding all input attributes can give a maximum of Fi -workflow -privacy, where Fi 
is the size of the range of the module. On the other hand hiding all output attributes can give a maximum 
of F2-workflow-privacy, where F2 is the size of the co-domain of the module, which can be much larger 
than the actual range. We therefore focus in the rest of this paper on safe subsets that contain only output 
attributes. 
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Figure 2: (a) Propagation model, (b) A single-predecessor workflow. White modules are public, grey are 
private; the box denotes the composite module M for H2 = {as}- 

3.2 Repeated Propagation 

Consider again Example |3] and assume now that public module sends its output to another public module 
m4 that implements an equality function (or a one-one invertible function). Even if the output of is hidden 
as described above, if the output of 014 remains visible, the privacy of m2 is again jeopardized since the output 
of can be inferred using the inverse function of 1114. We thus need to propagate the attribute hiding to 1114 
as well. More generally, we need to propagate the attribute hiding repeatedly, through all adjacent public 
modules, until we reach another private module. 

To formally define the closure of public modules to which attributes hiding must be propagated, we use 
the notion of a public path. Intuitively, there is a public path from a public module to a public module 
nij if we can reach my from m,- by a path comprising only public modules. In what follows, we define both 
directed and undirected pubhc paths; recall that A; = /; U O, denotes the set of input and output attributes of 
module m, . 

Definition 5. A public module mi has a directed (resp. an undirected) public path to a public module m2 
if there is a sequence of public modules mi^,mi^,- ■ ■ ,mi. such that m,| = mi, m, = m2, and for all I <k < j, 
Oi, n h,^, ^ (resp. Ai^ n A,-,^, / 0). 

This notion naturally extends to module attributes. We say that an input attribute a G /i of a public 
module mi has an (un)directed public path to a public module m2 (and also to any output attribute b G O2), 
if there is an (un)directed public path from mi to m2. The set of public modules to which attribute hiding 
will be propagated can now be defined as follows. 

Definition 6. Given a private module m,- and a set of hidden output attributes hj C Oj of m,-, the public- 
closure C{hi) of mi with respect to hi is the set of public modules reachable from some attribute in hi by an 
undirected public path. 

Example 4. We illustrate these notions using Figure^b\ The public module m4 has an undirected public path 
to the public module mg through the modules m-j and m^. For private module m2, if hidden output attributes 



9 



h2 = {a2}, {a^}, or {a2,aT,}, the public closure C{h2) = {mT,,m4,m(,,mj}. For 1x2 = {a^}, C{h2) = {ni^^m^}. 
In our subsequent analysis, it will be convenient to view the public-closure as a virtual composite module 
that encapsulates the sub-workflow and behaves like it. For instance, the box in Figure \2b\ denotes the 
composite module M representing C{{a2}), that has input attributes 02,03, and output attributes aio,aii 
and a\2. 

3.3 Selection of hidden attributes 

In Example [3l it is fairly easy to see which attributes of or m^, need to be hidden to preserve the privacy 
of m2. For the general case, where the public modules are not as simple as equality functions, to determine 
which attributes of a given public module need to be hidden we use the notions of upstream and downstream 
safety. To define them we use the following notion of tuple equivalence with respect to a given set of hidden 
attributes. Recall that A denotes the set of all attributes in the workflow; we also use bold-faced letters x, y, z, 
etc. to denote tuples in the workflow or module relations with one or more attributes. 

Definition 7. Given two tuples x and y on a subset of attributes B QA, and a subset of hidden attributes 
HQ A, we say that x=Hy ;jn5\^(x) = n5\^(y). 

Definition 8. Given a subset of hidden attributes H C A, of a public module m,, m,- is called 

• downstream-safe(or, D-safein short) with respect to H if for any two equivalent input tuples 
x,x' to mi with respect to H, their outputs are also equivalent: 

[x =H x'] [mi{x) =H m,-(x')] , 

• upstream-safe(or, U—safein short) with respect to H if for any two equivalent outputs y,y' of 
mi with respect to H, all of their preimages are also equivalent: 

[(y =H y') A (m;(x) = y,m,-(x') = y')] ^ [x =h x'] , 

• upstream-downstream-safe^or, UD-safein short) with respect to H if it is both U-safe 
and D-safe. 

Note that if H =A (i.e. all attributes are hidden) then m,- is clearly UD-saf e with respect to to H. We 
call this the trivial UD-saf e subset for m,-. 

Example 5. Figure\3\shows some example module relations. For an (identity) module having relation R\ in 
Figure\3^ the hidden subsets {ai^a^,} and {a2,a/\} are UD-saf e. Note thatH = {ai^a^x} is not a UD-saf e 
subset: for tuples having the same values of visible attribute a2, say 0, the values of as are not the same. 
For a module having relation R2 in Figure\3b\ a UD-saf e hidden subset is {a'2}. but there is no UD-saf e 
subset that does not include a2. It can also be checked that the module mi in FieureUaldoes not have any 
non-trivial UD-saf e subset. 

The first question we attempt to answer is whether there is a composability theorem analogous to Theo- 
rem[T]that works in the presence of public modules. In particular, we will show that for a class of workflows 
called single-predecessor workflows one can construct a private solution for the whole workflow by taking 
safe standalone solutions for the private modules, and then ensuring the UD-saf e properties of the pubhc 
modules in the corresponding public-closure. Next we define this class of workflows: 
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Figure 3: UD-safe solutions for modules 
Definition 9. A workflow W is called a single-predecessor workflow, ;/ 

1. W has no data-sharing, i.e. for m, 7^ mj, liOlj = 0, and, 

2. for every public module mj that belongs to a public-closure with respect to some output attribute(s) 
of a private module mi, mi is the only private module that has a directed public path to mj (i.e. m, is 
the single private predecessor ofmj). 

Example 6. Again consider Figure^b\which shows a single-predecessor workflow. Modules m3,m4, m^^m-i 
have undirected public paths from a2 € O2 f output attribute of m2 j, whereas m^ , mg have an undirected 
(also directed) public path from a^ € O2; also my is the single private-predecessor of m^, ...,ms that has a 
directed path to each of module. The public module my does not have any private predecessor, but m\ does 
not belong to the public-closure with respect to the output attributes of any private module. 

Although single-predecessor workflows are more restrictive than general workflows, the above example 
illustrates that they can still capture fairly intricate workflow structures, and more importantly, they can 
capture commonly found chain and tree workflows ||3]. Next in Section |4l we focus on single-predecessor 
workflows; then we explain in Section [5]how general workflows can be handled. 



4 Single-Predecessor Workflows 

The main motivation behind the study of single-predecessor workflows is to obtain a composability theorem 
similar to Theorem [T] combining solutions of standalone private and public modules. In Section 14. 1[ we 
show that such a composability theorem indeed exists for this class of workflows. Then we study how to 
optimally compose the standalone solutions in Section 14^21 



4.1 Composability Theorem for Privacy 

The following composability theorem says that, for each private module mi, it suffices to (i) find a safe 
hidden subset of output attributes (downstream propagation), (ii) find a superset of these hidden attributes 
such that each public module in their public closure is UD-safe, and (iii) no attributes outside the public 
closure and m,- are hidden {i.e. no unnecessary hiding). Then union of these subsets of hidden attributes is 
workflow-private for each private module in the workflow. Theorem |2] stated below formalizes these three 
conditions. 

Theorem 2, ( Composability Theorem for Single-predecessor Workflows ) Let W be a single-predecessor 
workflow. For each private module mi in W, let Hi be a subset of hidden attributes such that (i) hi = Hi PI 0,- 
is safe for T-standalone-privacy of mi, (ii) each public module mj in the public-closure C{hi) is UD-safe 



11 



with respect to Aj HHi, and (Hi) Hj C f?,- U [Jj-mjeC{hi)^j- Then the workflow W is T-private with respect 

to H Ui:m, is private 

First, in Section |4.1.1[ we argue why the conditions and assumptions in the above theorem are necessary; 
then we prove the theorem in Section 14.1.21 



4.1.1 Necessity of the Assumptions in Theorem |2] 

Theorem |2]has two non-trivial conditions: (1) the workflows are single-predecessor workflows, and (2) the 
public modules in the public closure must be UD - s a f e with respect to the hidden subset; the third condition 
that there is no unnecessary data hiding is required since the property UD-saf ety of public modules is not 
valid with respect to set inclusion. The necessity of the first two conditions are discussed in Propositions [T] 
and |2] respectively. 

In the proof of these propositions we will consider the different possible worlds of the workflow view 
and focus on the behavior (input-to-output mapping) m, of the module m, as seen in these worlds. This may 
be different than its true behavior recorded in the actual workflow relation R, and we will say that m, is 
redefined as m, in the given world. Note that m, and mi, viewed as relations, agree on the visible attributes 
of the the view but may differ in the non visible ones. 



Necessity of Single-Predecessor Workflows The next proposition shows that single-predecessor work- 
flows constitute ithe largest class of workflows for which a composability theorem involving both public and 
private modules can succeed. 

Proposition 1. There is a workflow W, which is not a single-predecessor workflow, and a private module 
nii in W, where even hiding all output attributes ofmj and all attributes of all the public modules in W does 
not give T -privacy for any F > 1. 

Proof. By Definition|9l a workflow W is not a single-predecessor workflow if one of the following holds: (i) 
there is a public module my in W that belongs to a public-closure of a private module m, but has no directed 
path from m,, or, (ii) such a public module mj has a directed path from more than one private module, or 
(iii) W has data sharing. We now show an example for condition (i). Examples for the remaining conditions 
can be found in Appendix lA.il 

Consider the workflow Wa in Figure |4al Here the public module m2 belongs to the public-closure 
C({a3}) of mi, but there is no directed public path from mi to m2, thereby violating the condition of single- 
predecessor workflows (though there is no data sharing). Module functionality is as follows: (i) mi takes a\ 
as input and produces 03 = mi(ai) = a\. (ii) m2 takes 02 as input and produces a4 = m2{a2) = a2- (iii) "Js 
takes (33, a4 as input and produces cjj = mT,{a-i,ai\) = 03 Va4 (OR), (iv) m4 takes a=, as input and produces 
(36 = ma,{as) = a^. All attributes take values in {0, 1}. 

Clearly, hiding output {(33} of mi gives 2-standalone privacy. We claim that hiding all output attributes 
of mi and all attributes of all pubhc modules {i.e. {a2,a3,a4,a5}) gives only trivial 1-workflow-privacy 
for mi, although it satisfies the UD-saf e condition of m2,m^. To see this, consider the relation /?„ of all 
executions of given in Table[T] where the hidden values are in Grey. The rows (tuples) here are numbered 
ri , . . . , r4 for later reference. 

When aj, is hidden, a possible candidate output of input ai = to mi is 1. So we need to have a possible 
world where mi is redefined as mi(0) = 1. This would restrict aj to 1 whenever ai = 0. But note that 
whenever a^ = I, a^ = I, irrespective of the value of a4 (m^ is an OR function). 
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Table 1 : Relation Ra for workflow Wa given in Figure |4a] 
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Figure 4: Necessity of the conditions in Theorem |2l (a) Single-predecessor workflows, (b) UD-saf ety for 
public modules; White modules are pubUc, grey are private. 



This affects the rows ri and r2 in R. Both these rows must have = I, however ri has a(, = 0, and r2 
has = 1. But this is impossible since, whatever the new definition m4 of private module 1114 is, it cannot 
map flj to both and 1; m4 must be a function and maintain the functional dependency as — )• ag. Hence all 
possible worlds of Ra must map mi (0) to 0, and therefore F = 1. □ 

Necessity of UD-saf ety for public modules Example [3] in the previous section motivated why the 
downward-safety condition is necessary and natural. The following proposition illustrates the need for 
the additional upward-safety condition in Theorem |2j even when we consider downstream-propagation. 

Proposition 2. There is a workflow W with a private module m,-, and a safe subset of hidden attributes hi 
guaranteeing T-standalone-privacy for m,- fF > 1 j, such that satisfying only the downstream- safety condition 
for the public modules in C{hi) does not give T-workflow-privacy for mi for any F > 1. 

Proof. Consider the chain workflow Wb given in Figure|^with three modules mi ,m2,mi, defined as follows, 
(i) (a3,<34) = m[{ai,a2) where as = a\ and a^ = a2, (ii) as = m2(<33,a4) = a^V a4 (OR), (iii) ag = ms{as) = 
as. mi, mi, are private whereas /M2 is public. All attributes take values in {0, 1}. Clearly hiding output 03 
of mi gives F-standalone privacy for F = 2. Now suppose 03 is hidden in the workflow. Since m2 is public 
(known to be OR function), as must be hidden (downstream-safety condition). Otherwise from visible 
output as and input 0:4, some values of hidden input a3 can be uniquely determined (eg. if 0:5 = 0,0:4 = 0, 
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then = and if (25 = 1,0:4 = 0, then aj = 1). On attributes (<3i ,a2,<33,a4,fl'5,a6)> the original relation R is 
shown in Table |2] (the hidden attributes and their values are underlined in the text and in grey in the table). 
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Table 2: Relation R for workflow given in Figure |4b] 

Let us first consider an input (0,0) to mi. When 03 is hidden, a possible candidate output y of input tuple 
X = (0,0) to nil is (i,0). So we need to have a possible world where mi is redefined as mi (0,0) = (1,0). 
To be consistent on the visible attributes, this forces us to redefine m^ to tni, where ms{l) =0; otherwise 
the row (0,0,0,0,0,0) in R changes to (0,0,_1,0,J_, 1). This in turn forces us to define mi (1,0) = (0,0) and 
^3(0) = 1. (This is because if we map mi (1,0) to any of {(1,0), (0, 1), (1, 1)}, either we have inconsistency 
on the visible attribute (24, or (35 = 1, and m3(l) =0, which gives a contradiction on the visible attribute 
a(, = 1.) 

Now consider the input (1, 1) to mi. For the sake of consistency on the visible attribute (33, mi(l, 1) 
can take value (1,1) or (0,1). But if mi(l,l) = (1,1) or (0,1), we have an inconsistency on the visible 
attribute tjg. For this input in the original relation R, as = a(, = I. Due to the redefinition of m3(l) = 0, we 
have inconsistency on a^. But note that the downstream-safety condition has been satisfied so far by hiding 
a3 and as. To have consistency on the visible attribute (36 in the row (1, l,i, l,i, 1), we must have as = 
(since m3(0) = 1). The pre-image of aj = is a^ = 0,(34 = 0, hence we have to redefine mi(l, 1) = (0,0). 
But (0,0) is not equivalent to original mi(l, 1) = {1, 1) with respect to the visible attribute 04. So the only 
solution in this case for F > 1, assuming that we do not hide output a^ of private module m3, is to hide 04, 
which makes the public module m2 both upstream and downstream-safe. □ 

This example also suggests that upstream-safety is needed only when a private module gets input from a 
module in the public-closure. We will see later the proof of Lemma [T] (Section 14. 1.21 ) that this is indeed the 
case. 

4.1.2 Proof of Composability Theorem 

To prove F-privacy, we need to show the existence of at least F possible outputs for each input to each private 
module, originating from the possible worlds of the workflow relation with respect to the visible attributes. 
First we present a crucial lemma, which shows the existence of many possible outputs for any fixed input to 
any fixed private module in the workflow, when the conditions in Theorem |2] are satisfied. In particular, this 
lemma shows that any candidate output for a given input for standalone privacy remains a candidate output 
for workflow-privacy, even when the private module interacts with other private and public module in a 
(single-predecessor) workflow. Therefore, if there are > F candidate outputs for standalone-privacy, there 
will be > F candidate outputs for workflow-privacy. Later in this section we will formally prove Theorem |2] 
using this lemma. 

Lemma 1. Consider a standalone private module m,, a set of hidden attributes hi, any input x to m,-, and any 
candidate output y € OUTx,m,,/i, ofx. Then y € OUTx,w,//, when m,- belongs to a single-predecessor workflow 
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(a) Original modules in i,m-i (b) Redefined mf , 



Figure 5: Illustration of Example IT) Input-output relationship in (a) original workflow, (b) possible world 
mapping x to y. 

W, and a set attributes Hj C A is hidden such that (i) hj C Hi, (ii) only output attributes from Oi are included 
in hi ( i. e. hi C Oi), and ( Hi) every module mj in the public-closure C {hi ) is UD-sa fe with respect to Aj PI //,. 

To prove the lemma, we will (arbitrarily) fix a private module m,-, an input x to m,, a hidden subset hi, 
and a candidate output y € OUTx,m,,/i, for x. The proof comprises two steps: 

(Step-1) Consider the connected subgraph C(/i;) as a single composite public module M, or equivalently as- 
sume that C{hi) contains a single public module. By the properties of single -predecessor workflows, 
M gets all its inputs from m,, but can send its outputs to one, multiple, or zero (for final output) private 
modules. Let / (respectively O) be the input (respectively output) attribute sets of M. In Figure|2bJ the 
box is M, / = {a2,aT,} and O = {aio,aii ,ai2,fl'i3}- We argue that when M is UD-saf e with respect 
to visible attributes (/ U O) CiHi, and the other conditions of LemmaH] are satisfied, then y £ OVTx.WHj ■ 

(Step-2) We show that if every public module in the composite module M = C{hi) is UD-saf e, then M is 
UD-saf e. To continue with our example, in Figure |2b] assuming that ms, m^, mg, m-j are UD-saf e 
with respect to the hidden attributes, we have to show that M is UD-saf e. 

Proof of Step-1. The proof of Lemma [T] is involved even for the restricted scenario in Step-1, in which 
C{hi) contains a single public module. Due to space constraints, the proof is given in Appendix IA.2I and we 
illustrate here the key ideas using a simple example of a chain workflow. 

Example 7. Consider a chain workflow, for instance, the one given in Figure^b\with the relation in Tabled 
Fix module mi = mi. Hiding its output h[ = {ai,} gives T-standalone-privacy for T =2. Fix input x = (0,0), 
with original output z = mi(x) = (0,0) (hidden attribute a^ is underlined). Also fix a candidate output 
y = (1,0) € OUTx,mi,/!|- Note that y and z are equivalent on the visible attribute {(34}. 

First, consider the simpler case when m^ does not exist, i.e. W contains only two modules mi,m2, and 
the column for a^ does not exist in Table |2] As we mentioned before, when the composite public module 
does not have any private successor, we only need the downstream- safety property for modules in C{hi); in 
this case, C{hi) comprises a single public module, m^. We construct a possible world R' ofR by redefining 
module mi to fhi as follows: fhi simply maps all pre-images ofy to z, and all pre-images ofz to y. In this 
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Table 3: Relation 7?', a possible world of the relation R for the workflow in Figure ^with respect to Hi = 
{03,(24,0:5}. 

case, both y,z have single pre-image. So x= (0,0) gets mapped to (1,0) and input (1,0) gets mapped to 
(0,0). To make m2 downstream-private, we hide output a^ o/wJ2- Therefore, the set of hidden attributes 
H\ = {(33,(25}. Finally R' is formed by the join of relations for fhi and m2. Note that the projection ofR,R', 
will be the same on visible attributes 01,02,(34 (in R', the first row will be (0,0,1,0,0) and the third row will 
be (1,0,0,0,0)). 

Next consider the more complicated case, when the modules in C{hi) have private successors (in this 
example, when the private module m^ is present). We already argued in the proof of Proposition^ that we 
also need to hide the input 04 to ensure workflow privacy for F > 1 f UD-safety). Let us now describe the 
proof strategy when 04 is hidden, i.e. H\ = {03,04,05}. 

Let Wy = m2{y) and = m2{z) (see Fieure\5a)il. We redefine mi to fn\ as follows (see FigureUbll. For all 
input u to mi such that u € m^^m^^^ (wj,) (respectively u G m^^mj^ (wy)), we define mi(u) = y (respectively 
rhi (u) = z). Note that the mapping of tuples u that are not necessarily m^ ' (y) or m^ ^ (z) are being redefined 
under mi (see Fisure\5b\). For in^, we define, in^(\Vy) = m^(Wj^) and in^iw-i^) = mj{Wy). Recall that y =//| z 
(y,z have the same values of visible attributes). Since m2 is downstream-safe Wy =Hi w^. Since m2 is also 
upstream-safe, for all input u to mi that are being redefined byfhi, their images under mi are equivalent with 
respect to Hi (and therefore with y and z). In our example, Wy = /wzdjO) = (1), and = ?M3(0,0) = (0). 
m^^m2^{yVj) = {(0,0)} o?i(i mj^'mj '(wy) = {(0,1), (1,0), (1,1)}. So mi maps (0,0) to (1,0) and all of 
{(0,1), (1,0), (1,1)} to (0,0); m3 maps (0) to (1) and \ to (0). 

Consider the relation R' formed by joining the relations ofm\, m2, m^ (see Table^. The relation R' has 
the same projection on visible attributes {oi ,a2,a(,} as R in Tabled and the public module m2 is unchanged. 
So R' is a possible world ofR that maps x = (0,0) to y = (1,0) as desired, i.e. y € OuTx,^.//!- ^ 

The argument for more general single-predecessor workflows , like the one given in Figure |2bJ is more 
complex. Here a private module (like win) can get inputs from m, (in Figure |2bJ /M2), from its public-closure 
C{hi) (in the figure, mg), and also from the private successors of the modules in C(/j,) (in the figure, mio). 
In this case, the tuples Wy,Wz are not well-defined, and redefining the private modules is more complex. In 
the proof of the lemma we give the formal argument using an extended fiipping function, that selectively 
changes part of inputs and outputs of the private module based on their connection with the private module 
m, considered in the lemma. 

Proof of Step-2. The following lemma formalizes the claim in Step-2: 

Lemma 2. Let M be a composite module consisting only of public modules. Let H be a subset of hidden 
attributes such that every public module mj in M is UD-safe with respect to AjtlH. Then M is UD-safe 
with respect to (7 U 6>) n H. 

Sketch. The formal proof of this lemma is given in Appendix IA.3I We sketch here the main ideas. To 
prove the lemma, we show that if every module in the public-closure is downstream-safe (respectively 
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upstream-safe), then M is downstream-safe (respectively upstream-safe). For downstream-safety, we con- 
sider the modules in M in topological order, say ni,, , • • • (in Figure [20 k = 4 and the modules in order 
may be mT,,m(,,m4,mj). Let be the (partial) composite public module formed by the union of modules 
mj^ ,mjj, and let P ,0^ be its input and output (the attributes that are either from a module not in to a 
module in M^, or to a module not in from a module in Clearly, = {m,, } and = M. Then by 
induction from 7 = 1 to k, we show that is downstream-safe with respect to U O^) CiH if all of rriy, 
1 < £ < 7 are downstream-safe with respect to (7,^ U 0,^ ) HH = r\H. For upstream-safety, we consider 
the modules in reverse topological order, mi^. , • • • , , and give a similar argument by an induction on j = k 
down to 1. □ 

Proof of Theorem |2] Now we complete the proof of Theorem |2] using Lemma [T] 

of Theorem^ We first argue that if Hi satisfies the conditions in Theorem |2] then /// = Uf m/ is private 
satisfies the conditions in Lemma[T] Since = //,• n Oi, (i) hi C Hj C IJ^ mf is private — ^'i ' ^^^^ (ii) ^ ^r- 
Next we argue that the third requirement in the lemma, (iii) every module nij in the public-closure C(/j,) is 
UD-saf e with respect to Hj HAj, also holds. 

To see (iii), observe that the Theorem|2]has an additional condition on Hi C U|J;:mjec(/!,) A;- Since 
W is a single-predecessor workflow, for two private modules mi,me, the public closures C(/j,) nC(/i£) = 
(this follows directly from the definition of single-predecessor workflows). Further, since W is single- 
predecessor, W has no data-sharing by definition. So for any two modules mj,mf in W (public or private), 
the set of attributes Aj flAf = 0. Clearly, when m,- is a private module, m,- ^ C{h() for any private module mi 
in W, by the definition of public-closure. Hence for any two private modules mi,m(i, 



In particular, for two private modules / m^, HiHHi = 0. Hence, for a public module mj G C{hi), and for 
any other private module m^. Ay n//^ = 0. Therefore, AjCiHl = Aj n {\J{.„y. 

is private ^t) — ^7 \ ^i- SinCC Mj 

is UD-saf e with respect to AjCiHi from the condition in the theorem, mj is also UD-saf e with respect 
to Aj n Hj. This shows that H- satisfies the conditions stated in the lemma. 

Theorem |2] also states that each private module m, is F-standalone-private with respect to hi, i.e., 
|OuTx.m, ,/i, I > r for all input x to m, (see Definition|2ll. From Lemma[Tl using /// in place of Hi, this implies 
that for all input x to private modules m,-, |OuTx.vk,//' ! > T where Hj = U^ mc is private From Definition HJ 
this implies that each private module m, is F-workflow-private Hj which is the same as H in Theorem |2] 
Since this is true for all private module m, in W, W is F-private with respect to H. □ 

4.2 Optimal Composition for Single Predecessor Workflows 

Recall the optimal composition problem mentioned in Section 12.31 This problem focused on optimally 
combining the safe solutions for private modules in an all-private workflow in order to minimize the cost of 
hidden attributes. In this section, we consider optimal composition for a single-predecessor workflow W 
with private and public modules. Our goal is to find subsets //, for each private module m, in W satisfying 
the conditions given in Theorem |2] such that cost(//) is minimized for H = |J,-.^. pnvate This we solve 
in four steps: (I) find the safe solutions for standalone-privacy for individual private modules; (II) find the 
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UD-saf e solutions for individual public modules; (III) find the optimal hidden subset Hi for the public- 
closure of every private module m, using the outputs of the first two steps; and (IV) combine //,-s to find the 
final optimal solution H. We next consider each of these steps. 

I. Private Solutions for Individual Private Modules For each private module m, we compute the set of 
safe subsets S,- = {S^, ■ ■ ■ ,Sip.}, where each Si( C O, is standalone-private for m,. Here pi is the number 
of safe subsets for m,. Recall from Theorem |2] that the choice of safe subset for m, determines its public- 
closure (and consequently the possible Hi sets and the cost of the overall solution). It is thus not sufficient 
to consider only the safe subsets that have the minimum cost; we need to keep all safe subsets for m,, to be 
examined by subsequent steps. 

The complexity of finding safe subsets for individual private modules has been thoroughly studied in 
||l5J under the name standalone Secure-View problem. It was shown that deciding whether a given 
hidden subset of attributes is safe for a private module is NP-hard in the number of attributes of the module. 
It was further shown that the set of all safe subsets for the module can be computed in time exponential in 
the number of attributes assuming constant domain size, which almost matches the lower bounds. 

Although the lower and upper bounds are somewhat disappointing, as argued in |15J, the number of 
attributes of an individual module is fairly small. The assumption of constant domain is reasonable for 
practical purposes, assuming that the integers and reals are represented in a fixed number of bits. In these 
cases the individual relations can be big, however this computation can be done only once as a pre-processing 
step and the cost can be amortized over possibly many uses of the module in different workflows. Expert 
knowledge (from the module designer) can also be used to help find the safe subsets. 

II. Safe Solutions for Individual Public Modules This step focuses on finding the set of all UD-saf e 

solutions for the individual public modules. We denote the UD-saf e solutions for a public module mj 
by \]j = {Uj\,- ■ • , Ujpj}, where each UD-saf e subset Uje C Ay, and pj denotes the number of UD-saf e 
solutions for the public module mj. We will see below in Theorem [3] that even deciding whether a given 
subset is UD - s a f e for a module is coNP-hard in the number of attributes (and that the set of all such subsets 
can be computed in exponential time). However, as argued in the first step, this computation can be done 
once as a pre-processing step with its cost amortized over possibly many workflows where the module is 
used. In addition, it suffices to compute the UD-saf e subsets for only those public modules that belong to 
some public-closure for some private module. 

Theorem 3. Given public module mj with k attributes, and a subset of hidden attributes H, deciding whether 
mj is UD-saf e with respect to H is coNP-hard in k. Further, all UD-saf e subsets can be found in EXP- 
time in k. 

Sketch of NP-hardness. The reduction is from the UNSAT problem, where given n variables xi, • ■ • and 
a 3NF formula /(xi, • • • ,x„), the goal is to check whether / is not satisfiable. In our construction, m,- has 
n + l inputs xi, ■ • ■ ,x„ and y, and the output is z = m,(xi, • • • ,Xn,y) = f{xi, ■ ■ ■ ,x„) Vy (OR). The set of 
hidden attributes is xi,--- ,x„ (i.e. y,z are visible). We claim that / is not satisfiable if and only if m,- is 
UD-saf e with respect to //. □ 

The same construction in the NP-hardness proof, with attributes y and z assigned cost zero and all other 
attributes assigned some higher constant cost, can be used to show that testing whether a safe subset with 
cost smaller than a given threshold exists is also coNP-hard. 

Regarding the upper bound, the trivial algorithm of going over all 2*^ subsets h of Ay, and checking if h 
is UD-saf e for m,-, can be done in EXP-time in k when the domain size is constant. Since the UD-saf e 
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property is not monotone with respect to further deletion of attributes, if h is UD-saf e, its supersets may 
not be UD-saf e. Recall however that the trivial solution h =Aj (deleting all attributes) is always UD-saf e 
for nij. So for practical purposes, when the public-closure for a private module involves a small number of 
attributes of the public modules in the closure, or if the attributes of those public modules have small cost, 
this solution can be used. The complete proof of the theorem is given in Appendix lB.il 

III. Optimal Hi for Each Private Module The third step aims to find a set Hj of hidden attributes, of 
minimum cost, for every private module m,. As per the theorem statement, this set //, should satisfy the 
conditions: (a) Hj n O,- = Sk, for some safe subset Ste E S,-; (b) for every public module mj in the closure 
C{Sie), there exists a UD-saf e subset Ujq € Uj such that Ujq = AjCiHi; and (c) Hi does not include any 
attribute outside O, and C{Si(). 

We show that, for the important class of chain and tree workflows, this optimization problem is solvable 
in time polynomial in the number of modules n, the total number of attributes in the workflow |A|, and the 
maximum number of sets in S; and Uj (denoted by L = max,g[j „] pi): 

Theorem 4. For each private module m, in a tree workflow ( and therefore, in a chain workflow ), the optimal 
subset Hi can be found in polynomial time in n, \A\ and L. 

On the other hand, the problem is NP-hard when the workflow has arbitrary DAG structure even when 
both the number of attributes and the number of safe and UD-saf e subsets of the individual modules are 
bounded by a small constant. 

In contrast, the problem becomes NP-hard in n when the public-closure forms an arbitrary directed 
acyclic subgraph, even when L is a constant and the number of attributes of the individual modules is 
bounded by a small constant. 

Chain workflows are the simplest class of tree-shaped workflow, hence clearly any algorithm for trees 
will also work for chains. However, for the sake of simplicity, we give the optimal algorithm for chain 
workflows first; then we discuss how it can be proved for tree workflows. 

Optimal algorithm for chain workflows. Consider any private module m,. Given a safe subset Sa € S,-, 
we show below how an optimal subset Hi in C{Si[) satisfying the desired properties can be obtained. We 
then repeat this process for all safe subsets (bounded by L) Si£ € S,-, and output the subset Hi with minimum 
cost. We drop the subscripts to simplify the notation (i.e. use S for Sa, C for C{Sie), and H for //,). 

Our poly-time algorithm employs dynamic programming to find the optimal H. First note that since C 
is the public-closure of output attributes for a chain workflow, C should be a chain itself. Let the modules in 
C be renumbered as mi, • • • in order. Now we solve the problem by dynamic programming as follows. 
Let Qhe an kx L two-dimensional array, where Q[j,£] denotes the cost of minimum cost hidden subset H^^ 
that satisfies the UD-saf e condition for all public modules m\ to mj and Ay n//^^ = Uje € Uj. Here j < k, 
i < Pj < L, and Aj is the attribute set of mf, the actual solution can be stored easily by standard argument. 

The initialization step is , for 1 < i < pi, 

Q[l,e] = c{Ui^e) if Uu^S 
= oo otherwise 
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Recall that for a chain, Oj-i = Ij, for j = 2 to k. Then for 7 = 2 to ^, ^ = 1 to pj, 

Q[jj^] = °° if there is no 1 < ^ < pj_i 

such that t/y-Lg n Oj I = Uj^iCiIj 
= ciOjnUje)+mmQ[j-l,q] 

where the minimum is taken over all such q 

It is interesting to note that such a q always exists for at least one I < pf. while defining UD-saf e 
subsets in Definition [H we discussed that any public module nij is UD-saf e when its entire attribute set Aj 
is hidden. Hence Aj^\ € Uj_i and Aj S Uy, which will make the equality check true (for a chain Oj-\ = Ij). 
In Appendix IB .21 we show that shows that 2 [7,^] correctly stores the desired value. Then the optimal solu- 
tion H has cost inini<£<pj Q[k,£]; the corresponding solution H can be found by standard procedure, which 
proves Theorem |4]for chain workflows. 

Observe that, more generally, the algorithm may also be used for non-chain workflows, if the public- 
closures of the safe subsets for private modules have chain shape. This observation also applies to the 
following discussion on tree workflows. 

Optimal algorithm for tree workflows. Now consider tree-shaped workflows, where every module in 
the workflow has at most one immediate predecessor (for all modules m,, if n Oy 7^ and /,• n O^: 7^ 0, then 
i = k), but a module can have one or more immediate successors. 

The treatment of tree-shaped workflows is similar to what we have seen for chains. Observe that, here 
again, since C is the public-closure of output attributes for a tree-shaped workflow, C will be a collection of 
trees all rooted at m,. As for the case of chains, the processing of the public closure is based on dynamic- 
programming. The key difference is that the modules in the tree are processed bottom up (rather than top 
down as in what we have seen above) to handle branching. The proof of Theorem |4] for tree workflows is 
given in Appendix IB. 3 1 

NP-hardness for public-closure of arbitrary shape. Finding the minimal-cost solution for public- 
closure with arbitrary DAG shape is NP-hard. We give a reduction from 3SAT (see Appendix IB. 41 ). The NP 
algorithm simply guesses a set of attributes and checks whether it forms a legal solution and has cost lower 
than the given bound; a corresponding EXP-time algorithm that iterates over all subsets can be used to find 
the optimal solution. 

The NP-completeness here is in n, the number of modules in the public closure. We note, however, that 
in practice the number of public modules that process the output on an individual private module is small. So 
the obtained solution to the optimum-view problem is still better than the naive one, which is exponential 
in the size of the. full workflow. 

IV. Optimal Hidden Subset H for the Workflow According to Theorem ^ H = IJ;™, is private a 

F-private solution for the workflow. Observe that finding the optimal (minimum cost) such solution H 
for single-predecessor workflows is straightforward, once the minimum cost //,-s are found: Due to the 
condition in Theorem |2]that no unnecessary data are hidden, it can be easily checked that for any two private 
modules mi,mk in a single predecessor workflow, HiHHic = 0. This implies that the optimal solution H can 
be obtained taking the union of the optimal hidden subsets //, for individual private modules obtained in the 
previous step. 



20 



5 General Workflows 



The previous sections focused on single-predecessor workflows. In particular, we presented a privacy the- 
orem for such workflows and studied optimization with respect to this theorem. The following two obser- 
vations highlight how this privacy theorem can be extended to general workflows. For lack of space the 
discussion is informal; the proof techniques are similar to single-predecessor workflows and are given in 
Appendix O 

Observation 1: Need for propagation through private modules. All examples in the previous sec- 
tions that showed the necessity of the single-predecessor assumption for private module ni,- had another 
private module as which is a successor of one public module in the public closure of m,. For instance, in 
the proof of Proposition [T](see FigureHa) m, = m\ and nik = m4. If we had continued hiding output attributes 
of m4, we could obtain the required possible worlds leading to a non-trivial privacy guarantee F > 1. This 
implies that for general workflows, the propagation of attribute hiding should continue outside the public 
closure and through the descendant private modules. 

Observation 2: D-safety suffices (instead of UD-safety). The proof of Lemma [T] shows that 
the UD-safety property of modules in the public-closure is needed only when some public module in the 
public-closure has a private successor whose output attributes are visible. If all modules in the public clo- 
sure have no such private successor, then a downstream-safety property (called the D-safety property) is 
sufficient. More generally, if attribute hiding is propagated through private modules (as discussed above), 
then it suffices to require the hidden attributes to satisfy the D-safety property rather than the stronger 
UD-safety property. 

The intuition from the above two observations is formalized in a privacy theorem for general workflows, 
analogous to Theorem|2] First, instead of public-closure, it uses downward-closure: for a private module m,, 
and a set of hidden attributes hi, the downward-closure D(/j,) consists of all modules (public or private) mj, 
that are reachable from by a directed path. Second, instead of requiring the sets Hi of hidden attributes to 
ensure UD-safety , it requires them to only ensure D-safety. 

The proof of the revised theorem is similar to that of Theorem |2j with the added complication that the Hi 
subsets are no longer guaranteed to be disjoint. This is resolved by proving that D-saf esubsets are closed 
under union, allowing for the (possibly overlapping) //, subsets computed for the individual private modules 
to be unioned. 

The hardness results from the previous section transfer to the case of general workflows. Since the //,-s 
in this case may be overlapping, the union of optimal //, solutions for individual modules m, may not give the 
optimal solution for the workflow. Whether or not there exists a non-trivial approximation is an interesting 
open problem. 

To conclude the discussion, note that for single-predecessor workflows, we now have two options to en- 
sure workflow-privacy: (i) to consider public-closures and ensure UD-safety properties for their modules 
(following the privacy theorem for single-predecessor workflows); or (ii) to consider downward-closures 
and ensure D-safety properties for their modules (following the privacy theorem for general workflows). 
Observe that these two options are incomparable: Satisfying UD-safety properties may require hiding 
more attributes than what is needed for satisfying D-safety properties. On the other hand, the downward- 
closure includes more modules than the public-closure (for instance the reachable private modules), and 
additional attributes must be hidden to satisfy their D-safety properties. One could therefore run both 



21 



algorithms, and choose the lower cost solution. 

6 Related Work 

Privacy concerns with respect to provenance were articulated in lfT6ll . in the context of scientific workflows, 
and in ifTTl . in the context of business processes. Preserving module privacy in all-private workflows was 
studied in f\S] and the idea of privatizing (hiding the "name" of) public modules to achieve privacy in 
public/private workflows was proposed. Unfortunately this is not realistic for many common scenarios. This 
paper thus presents a novel propagation model for attribute hiding which does not place any assumptions on 
the user's prior knowledge about public modules. 

Recent work by other authors includes the development of fine-grained access control languages for 
provenance f30l |32l |7] [H, and a graph grammar approach for rewriting redaction policies over prove- 
nance [9]. The approach in O provides users with informative graph query results using surrogates, which 
give less sensitive versions of nodes/edges, and proposes a utility measure for the result. A framework to 
output a partial view of a workflow that conforms to a given set of access permissions on the connections 
and input/output ports was proposed in [ 10|. Although related to module privacy, the approach may discon- 
nect connections between modules rather than just hiding the data which flows between them; furthermore, 
it may hide more provenance information than our mechanism. More importantly, the notion of privacy is 
informal and no guarantees on the quality of the solutions are provided. 

A related area is that of privacy-preserving data mining (see surveys |'4^ '33l, and the references therein). 
Here, the goal is to hide individual data attributes while retaining the suitability of the data for mining 
patterns. Privacy preserving approaches have been studied for social networks (e.g. [5]), auditing queries 
(e.g. |f29l ). network routing |[24l . and several other contexts. 

Our notion of module privacy is closest to the notion of ^-diversity considered in fTT\ which addresses 
some shortcomings of K:-anonymity 1,31,1 . The notion of ^-diversity tries to generalize the values of the 
non-sensitive attributes so that for every such generalization, there are at least I different values of sensitive 
attributes. The view -based approach for ^-anonymity along with its complexity has been studied in ll37l . 
Leakage of information due to knowledge on the techniques for minimizing data loss has been studied in 
ll34l l22l [141 [35 J : however, our privacy guarantees are information theoretic under our assumptions. 

Nevertheless, the privacy notion of ^-diversity is susceptible to attack when the user has background 
knowledge |[23l|25l. Differential privacy |[20l[T8l[T9l, which requires that the output distribution is almost 
invariant to the inclusion of any particular record, gives a stronger privacy guarantee. Although it was 
first proposed for statistical databases and aggregate queries, it has since been studied in domains such 
as mechanism design |[28l . data streaming |[2n . and several database-related applications (e.g. |[26] [36] 
[131 [HI). However, it is well-known that no deterministic algorithm can guarantee differential privacy, and 
the standard approach of including random noise is not suitable for our purposes — provenance queries 
are typically not aggregate queries, and we need the output views to be consistent (e.g. the same module 
must map the same input to the same output in all executions of the workflow). Defining an appropriate 
notion of differential privacy for module functionality with respect to provenance queries is an interesting 
open problem. It would also be interesting to study natural attacks for our application, and (theoretically or 
empirically) study the effectiveness of various notions of privacy under these attacks (see e.g. [12]). 
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7 Conclusion 



In this paper, we addressed the problem of preserving module privacy in public/private workflows (called 
workflow-privacy), by providing a view of provenance information in which the input to output mapping 
of private modules remains hidden. As several examples in this paper show, the workflow-privacy of a 
module critically depends on the structure (connection patterns) of the workflow, the behavior/functionality 
of other modules in the workflow, and the selection of hidden attributes. We showed how workflow-privacy 
can be achieved by propagating data hiding through public modules in both single-predecessor and general 
workflows. 

Several interesting future research directions related to the application of differential privacy were dis- 
cussed in Section [6] We assumed certain assumptions in the paper (constant domain size, acyclic nature 
of workflows, analysis using relations of executions, etc.). Even with these assumptions, the problem is 
highly non-trivial and large and important classes of workflows can be captured even under these assump- 
tion. However, it would be immensely important to have models and solutions that can be used in scientific 
experiments in practice. We have also mentioned the shortcomings of the F-privacy and the difficulty in 
using stronger privacy notions like differential privacy in the previous section. It will be interesting to see if 
the possible world model thoroughly studied in this paper can be used to facilitate the use of other privacy 
models under provenance queries. 
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Table 4: Relation Ra for workflow Wa given in Figure l6al 

A Proofs from Section 4 
A. 1 Proof of Proposition [1] 

By Definition |9l a workflow W is not a single-predecessor workflow, if one of the following holds: (i) there 
is a public module mj in W that belongs to the public-closure of a private module m, but has no directed path 
from m,, or, (ii) such a public module my has directed path from more than one private modules, or, (iii) W 
has data sharing. 

To prove the proposition we provide three example workflows where exactly one of the violating con- 
ditions (i), (ii), (iii) holds, and Theorem |2] does not hold in those workflows. Case (i) was shown in Section 
14.1.11 To complete the proof we demonstrate here cases (ii) and (iii). 

Multiple private predecessor We give an example where Theorem |2]does not hold when a public module 
belonging to a public-closure has more than one private predecessors. 

Example 8. Consider the workflow Wa in Figure^^ which is a modification ofWa by the addition of private 
module mo, that takes ao as input and produces a2 = mo(ao) = ao as output. The public module m^ is in 
public-closure ofm\, but has directed public paths from both mo and m\. The relation Rafor Wa in given in 
Table^where the hidden attributes {a2,as,a4,a5} are colored in grey. 

Now we have exactly the same problem as before: When m\ maps to 1, a^ = \ irrespective of the value 
of a/\. In the first row ag = 0, whereas in the second row = 1. However, whatever the new definitions of 
mo are for mo and m/[for m4, m^ cannot map 1 to both and 1. Hence F = 1. □ 

Data sharing Now we give an example where Theorem |2] does not hold when the workflow has data 
sharing. 

Example 9. Consider the workfiow, say Wh, given in Figure^b\ All attributes take values in {0, 1}. The 
initial inputs are a\,a2, and final outputs are a(,,a-]; only m^ is public. The functionality of modules is as 
follows: (i) mi takes ai,a2 as input and produces mi{ai,a2) = {a^ = ai,a4 = a2). (ii) m2 takes ai,a4 as 
input and produces as = m2{as,a4) = as Vfl'4 (OR), (iii) m^ takes as as input and produces ag = mj,[as) = as. 
(iv) m4 takes ai, as input and produces a-j = m/\{as) = aj,. Note that data aj, is input to both m2,m/\, hence 
the workflow has data sharing. 

Now focus on private module m\ = m,. Clearly hiding output a^ of m\ gives 2-standalone privacy, 
and for hidden attribute hj = {ai,}, the public-closure C{hi) = {m2}. As given in the theorem, Hi C 0, U 

^r.mjec(h,)^j = {fl'3,a4,«5} in this case. 

We claim that hiding even all of {aT,,a4^as} gives only trivial 1-workflow-privacy of m\, although the 
UD-safe condition is satisfled for m2 (actually hiding as, 0:4 gives 4-standalone-privacy for mi). Table\5\ 
gives the relation Rh, where the hidden attribute values are in Grey. 
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Figure 6: Proof of Proposition [T] (a) Multiple private predecessors, (b) Data sharing. White modules are 
pubUc, Grey are private. 
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Table 5: Relation Rh for workflow Wb given in Figure |6] 



When as (and also ai\) is hidden, a possible candidate output of input tuple x = (0,0) to m\ is (1,0). So 
we need to have a possible world where m\ is redefined as fhi (0, 0) = (1 , 0). Then a^ takes value 1 in the first 
row, and this is the only row with visible attributes a\ = 0,^2 = 0. So this requires that 023 (0:5 = 1) = [a^ = 0) 
and rh^{a^ = 1) = [a-] = 0), to have the same projection on visible a(,,aj. 

The second, third and fourth rows, r2 , r3 , r^, have a^ = \, so to have the same projection, we need = 
for these three rows, so we need mi,{as =0) = (ae = 1 ) ( since we had to already define m^{\) = Oj. When a$ 
is 0, since the public module m^ is an OR function, the only possibility of the values of a^, , (34 in rows ^2 , r3 , ^4 
are (0,0). A^ow we have a confiict on the value of the visible attribute ay, which is for r2 but 1 for r^,r4, 
whereas for all these rows the value ofai, is 0. m/\ being a function with dependency a^ a-], cannot map a^ 
to both and 1. Similarly we can check that iffhi{0,0) = (0, 1) or mi (0,0) = (1,1) (both 03,04 are hidden), 
we will have exactly the same problem. Hence all possible worlds ofRj with these hidden attributes must 
map mi (0,0) to (0,0), and therefore F = 1. □ 

A.2 Proof of Lemma [T] 

The proof of Lemma [T] uses the following lemma. It states that the if y is a candidate output of an input 
X to module m,- with respect to hidden attributes hi (i.e. y € OUTx,,„,,/j,), then y and the actual output of x, 
z = mi{x), must be equivalent. 

Lemma 3. Let m, be a standalone private module with relation Rj, let x be an input to m„ and let hi C O,- 
be a subset of hidden attributes. Ify G OUTx,m;,/!, then y =h. z where z = mi(x). 
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Note that, in Example El y = ( 1_, 0) and z = (0, 0) are equivalent on the visible attributes as Lemma |3] 
says (hidden attributes are underlined). 

Proof. A subset of output attributes of m,, /z, C d, is hidden. Recall that A, = /, U denotes the set of 
attributes of m, and let /?, be the standalone relation for m,. If y G OUTx,,,,,,/,,, then from Definition|2j 

37?'eWorlds(/?i,/z/), 3{! eR' s.t x = Ui,{{!) ^y = Uo,{^) (1) 

Further, from Definition [T] R' G Worlds(/?,,/j;) only if IT^.y,. (/?,) = n^.\/j. (/?'). Hence there must exist a 
tuple t G Ri such that 

nAA/.(t)=nA,v,(t') (2) 

Since h C O,-, /; C A; From ©, H/, (t) = H/, (t') = x. Let z = Ho, (t), i.e. z = mi(x). From ©, Uo,\h, (t) = 
no.y,.(t'), then n^ y,. (z) = no.y,.(y). Tuples y and z are defined on Oj. Hence from Definition |2l y =/,, 
z. ' ' ' ' ' ' □ 

Corollary 1. For a module ntj, and hidden attributes hj C Oi, if two tuples y,z defined on Oj are such that 
y z, then also y =Hj z where Hi 5 /i, /i' a i'ef of hidden attributes in the workflow. 

Proof Since y z, no,\;,,(y) = Uo,\h,{^). Since /z,- C //,-, 5 Therefore, no,\H,{y) = 

^Oi\H,i^),i-e-y=H,^- □ 

Note. Lemma[3]does not use any property of single-predecessor workflows and also works for general 
workflows. This lemma will be used again for the privacy theorem of general workflows (Theorem |5). 



Definition of F l i p and EF l i p (extended Flip) functions. To prove Lemma[Tl we need to show exis- 
tence of a possible world satisfying the criteria. This possible world will be obtained by joining alternative 
definitions of private modules, and the original definition of public modules. We will need the following flip- 
ping functions to formally present how we derive the alternative module definitions from original modules. 
These function examines parts of inputs, and possibly changes parts of original outputs. 

Definition 10. Given subsets of attributes P,Q '^A, two tuples p, q deflned on P, and a tuple u deflned on 
Q, FLiPp q(u) =y/ is a tuple defined on Q constructed as follows: 

• ifHQr,p{u) = ngnp(p). then w is such that IlQr,p{vf) = Hgnpiq) and IIq\p{w) = IIq\p(w), 

• else ifUgnpiu) = IlQnp{q), then w is such that ngnp(w) = nQnp(p) cind IlQ\p{y/) = nQ\p(w), 

• otherwise, w = u. 

The following observations capture the properties of F l i p function. 
Observation 1. 



1. //FLiPp q(u) = w, then FLIPp q(w) = u. 

2. FLIPp^q(FLIPp_q(u)) =U. 

3. IfPnQ = id, FLIPp,q (u) = u. 
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4. FLlPp,q(p) = q,FLlPp,q(q) = p. 

5. /jrnQnp(p) = nQnp(q), then FLlPp,q(u) = u. 

6. If Q = Q\LI Qi, where 2i n ^2 = 0, and /fPLlFp q(nQ( (u)) = wi and FLlPp q(nQ,(u)) = W2, then 
FLiPp q(u) = w such that ITq, (w) = wi and IlQ^{y() = W2. 

The above definition of flipping will be useful when we consider the scenario where M does not have 
any successor. When M has successors, we need an extended definition of tuple flipping based on other 
tuples, denoted by EFl i p, as defined below. 

Definition 11. Given subsets of attributes P,Q,R ^ A, where two tuples p,q defined on PUR, a tuple u 
defined on Q and a tuple v defined on R, EFLIPp q;v(u) =yv is a tuple defined on Q constructed as follows: 

• if\ = Hr(p), then w is such that nQnp(w) = ngnp(q) and IIq\p{w) = Hq\p(w), 

• else ify = IIr{(\), then w is such that ngnp(w) = ngnp(p) and ng\p(w) = ng\p(w), 

• otherwise, w = u. 

Note that EFLIPp q.npng(u)(u) = FLlPp^q(u), where R = PnQ. 

Observation 2. 1. 7/'EFLlPp_q;v(u) = w, and u' is a tuple defined on Q' C Q , then EFLIPp q;y(u') = 

ng/(w). 

Proof of Lemmalll Now we are ready to prove the lemma. As mentioned in Section |4. 1.2[ we will assume 
that there is a single (composite) public module M in the public closure C{hi) of m,. Recall that Ii,Oi,Ai 
denote the set of input, output and all attributes of m, respectively. 

Lemma [T] Consider a standalone private module m,-, a set of hidden attributes hj, any input x to nii, 
and any candidate output y € OUTx,,,,,,/;; ofx. Then y € OVT-^ w,Hi when m, belongs to a single-predecessor 
workflow W, and a set attributes Hj C A is hidden such that ( i) hi C Hj, ( ii) only output attributes from Oj are 
included in hi (i.e. hi C Oi), and (Hi) every module mj in the public-closure C{hi) is UD-safe with respect 
to Hi. 

Proof. We fix a module m;, an input x to m,, hidden attributes hi C Oi, and a candidate output y € OUTx,,,,,./?, 
for X. We assume that there is a single public module M in the public closure C{hi). By the properties of 
single-predecessor workflows, M gets all its inputs from m, and sends its outputs to zero or more than one 
private modules. We denote the inputs and outputs of M by / C /j, and O respectively. However m, can also 
send (i) its visible outputs to other public modules (these public modules will have m,- as its only predecessor, 
but these public modules will not have any public path in undirected sense to M), and it can send (ii) visible 
and hidden attributes to other private modules. 

From the conditions in the lemma, a set //, is hidden in the workflow where (i) hi C Hi, (ii) hi C Oi, and 
(iii) M is UD-safe with respect to Hi. We will show that y € OUTx.w,//,- We prove this by showing the 
existence of a possible world R' € Worlds (/?,//,), such that if n/, (t) = x for some t G R' , then no, (t) = y- 
Since y € OUTv,m;,/!p by Lemma[3l y =/,, z where z = m,(x). We consider two cases separately based on 
whether M has no successor or at least one private successors. 
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Case I. First consider the easier case that M does not have any successor, so all outputs of M belong 
to the set of final outputs. We redefine the module m,- to m,- as follows. For an input u to m,, mi{u) = 
FLiPy 2(m,(u)). All public modules are unchanged, inj = mj. All private modules mj / are redefined as 
follows: On an input u to mj, my(u) = my (FLlPy ^(u)). The required possible world R' is obtained by taking 
the join of the standalone relations of these mj-s, j S [n]. 

First note that by the definition of m,-, m,(x) = y (since mi{x) = FLiPy 2(m;(x)) = FLlPy.2(z) = y, from 
Observation [mil)). Hence if n/.(t) = x for some t € R', then no, (t) = y. 

Next we argue that R' € Worlds(/?,//,). Since R' is the join of the standalone relations for modules fhj-s, 
R' maintains all functional dependencies Ij — )• Oj. Also none of the public modules are unchanged, hence 
for any public module mj and any tuple t in R', Tloj{t) = rnj{Iljj{t)). So we only need to show that the 
projection of R and 7?' on the visible attributes are the same. 

Let us assume, wlog. that the modules are numbered in topologically sorted order. Let Iq be the initial 
input attributes to the workflow, and let p be a tuple defined on Iq. There are two unique tuples t € /? and 
t' € R' such that LI/, (t) = IT/, (t') = p. Since M does not have any successor, let us assume that M = m„+i, 
also wlog. assume that the public modules in C are not counted in j = 1 to « + 1 by renumbering the 
modules. Note that any intermediate or final attribute a G A \ /o belongs to Oj, for a unique j G [1,«] (since 
for j ^ I, Oj n Of = 0). So it suffices to show that t,t' projected on Oj are equivalent with respect to visible 
attributes for all module 7, j = 1 to « + 1. 

Let Cy ,„,Cy m bc the values of input attributes Ij and dy „,,dy m be the values of output attributes Oj of 
module nij, in t G /? and t' G R' respectively on initial input attributes p (i.e. ^ = 11/ (t), c^ ^ = IT/ (t'), 
Aj^m = no^(t) and Aj^ff^ = Yloj{\!))- We prove by induction on j = 1 to n that 

V7, 1 < J < «,dy,a = FLlPy,^(dy,™) (3) 

First we argue that proving (l3) shows that the join of (m,)i<,<„ is a possible world of R with respect 
to hidden attributes Hi. (A) When mj is a private module, note that dy^m and dy ^ = FLlPy,2(dy,„,) may differ 
only on attributes Oj fl O, But y =/,, z, i.e. these tuples are equivalent on the visible attributes. Hence for 
all private modules, the t,t' are equivalent with respect to Oj. (actually for all j / /, Oj fl Oj = 0, so the 
outputs are equal and therefore equivalent). (B) When my is a public module, j / n + 1, Oy n O, = 0, hence 
the values of t,t' on Oj are the same and therefore equivalent. (C) Finally, consider M = nin+i that is not 
covered hy M gets all its inputs from m,. From Q, 

^i,m — FLIPy ^(d, ,,)) 

Since y,z,d,-^m,d(,m are all defined on attributes Oj, and input to m„+i, C Oj, 

C«+l,m — FLIPy 2 (c,|^i 

Hence c„+i =//. Cn+i.m- Since these two inputs of are equivalent with respect to Hi, by the UD-saf e 
property of M = m„+i, its outputs are also equivalent, i.e. d„+i „ =//, d„+i Hence the projections of t,t' 
on 0,1+1 are also equivalent. Combining (A), (B), (C), t,t' are equivalent with respect to Hi. 

Proof of The base case follows for 7 = 1. If mi 7^ m, (my can be public or private), then Ii n O, = 0, 
so for all input u, 

my(u) = my(FLIPy,^(u)) = my(u) 
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Since the inputs = Ci (both projections of initial input p on the outputs di „ = di This 
shows If mi = nii, the inputs are the same, and by definition of mi, 

dl,m = '"l(Cl,m) 

= FLlPy,z(m,-(ci_a)) 
= FLlPy,z(m,-(ci,m)) 
= FLlPy,z(di,„,) 

This shows ©. 

Suppose the hypothesis holds until j — 1, consider mj. From the induction hypothesis, Cj ,^ = FLiPy ^(Cj „,), 
hence Cy ,,, = FLiPy ^(cj m) (see Observation [T]©). 

(i) If j = i, again. 



= FLlPy,z(m/(c/,,^)) 

= FLlPy,z(m;(FLlPy,;,(c,:m))) 

= FLlPy_z(mi((c,-,„)) 

= FLiPy,z(d,-,„) 

FLiPy ^(c,-,,,,) = Cj,m follows due to the fact that n = 0, y,z are defined on O,-, whereas c,.,,, is 
defined on /, (see Observation [T]®). 

(ii) If j / / and nij is a private module, 

= m/FLlPy,z(cy,,r,)) 
= FLiPy.z(dy.„,) 

FLiPy,z(d jjn) — dy „j follows due to the fact that Oj n Oj — 0, y,z are defined on Oi, whereas d,' „i is 
defined on Oj (again see Observation [T](|3]l). 

(iii) If mj is a public module, j < n, inj = mj. 
Here 

= nijlcjfji) 

= mj(FLlPy_z(Cj,„,)) 

— IfljlCjjfi) 
= dj,m 

= FLIPy^z(d;,„,) 



30 



FLiPy ^(dy m) = dy,m again follows due to the fact that Oj n O,- = 0. FLiPy ^(Cy „,) = *^j,m follows 
due to following reason. If Ij n O,- = 0, i.e. if mj does not get any input from m,, again this is true 
(Observation [US)). If lUj gets an input from m,-, i.e. Ij n O; 7^ 0, since nij / n O,- does not 

include any hidden attributes from hi. But y =/,, z, i.e. the visible attribute values of y,z are the same. 
In other words, n/^no,(y) = n/yno,(z), and from Observation [T]©, FLiPy ^(Cy ,,,) = Cj 

This completes the proof of the lemma for Case-I. 
Case II. 

Now consider the case when M has one or more private successors (note that M cannot have any public 
successor by definition). Let M = nik, and assume that the modules mi,-- - ,m„ are sorted in topological 
order. Hence I = 1^,0 = Ou, and 4 C f?,-. Let Wy = M(n4(y)), w, = M(Jlii^{z)). Instead of y,z, the flip 
function will be with respect to Y,Z, where Y is the concatenation of y and Wy (no, (Y) = y, noj(Y) = Wy), 
and Z is the concatenation of z and w^. Hence Y,Z are defined on attributes O, U Ok- 

We redefine the module m, to m, as follows. Note that since input to M, 4 C d, Oi is disjoint union of 
4 and Oi\Ik- For an input u to ni,-, mi(u), defined on O,- is such that 

^oAhi'^M) = FLiPY,z(no,v^K(u))) 

and 

ni^{mi{u)) = EFLlPY,z;M(n;Jm,(u)))(n/,(mi(u))) 

For the component with EF l i p , in terms of the notations in DefinitiondU R = Ok, P = Q = Oj. p = Y, q = 
Z, defined on PUR = Oil) Ok. v = M(n/^(m,(u))), defined on Ok. u in Definition [TT] corresponds to /m;(u). 
All public modules are unchanged, tnj = nij. All private modules mj / m,- are redefined as follows: On an 
input u to nij, mj{u) = my(FLlPY,z(u)). The required possible world R' is obtained by taking the join of the 
standalone relations of these mj-s, j € [n]. 

First note that by the definition of m,-, /m,(x) = y due to the following reason: 

(i) M(n/,(m,-(x))) = M(n/,,(z)) = w, = no,(Z), so 

n4(m,-(x)) = EFLIPY.Z;M(n,Jm,-W))(n4(m,-(x))) 
= EFLlPY,z;M(n,Jz))(n/,(z)) 

= n^Cy)) 

(ii) 

no,\,,imi{x)) == FLlPY,z(no,\4(mi(x))) 
= FLlPY,z(no,\4(z)) 

= no,\/,(y) 

Taking union of (i) and (ii), m,(x) = y. Hence if n/, (t) = x for some t G R', then no.(t) = y. 

Again, next we argue that R' € Worlds(/?,//,), and it suffices to show that the projection of R and R' on 
the visible attributes are the same. 

Let /o be the initial input attributes to the workflow, and let p be a tuple defined on Iq. There are two 
unique tuples t £ R and t' G R' such that 11/, (t) = IT/, (t') = p. Note that any intermediate or final attribute 
c? G A \ /o belongs to Oj, for a unique 7 G [1,«] (since for j ^ I, Oj r\Ot = <p). So it suffices to show that t,t' 
projected on Oj are equivalent with respect to visible attributes for all module 7, j = 1 to « + 1 . 
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Let CjmjCjm be the values of input attributes Ij and dy „j,dj be the values of output attributes Oj of 
module my, in t G /? and t' G R' respectively on initial input attributes p (i.e. Cj^m = n/^ (t), Cj^ = n/^ (t'), 
dy_,„ = Yloj{i) and dy ,7i = Yloj{i!)). We prove by induction on 7 = 1 to « that 



First we argue that proving @, dD and (|6]l shows that the join of (m,) i<,<„ is a possible world of R with 
respect to hidden attributes 

(A) When mj is a private module, j ^ i, note that dy,,,, and dy ,^ = FLlPY,z(dy,m) may differ only on 
attributes {Ok U O,) n Oj. But for 7 / / and j (mj is private module whereas mj. is the composite 
public module), [Ok U O,) PI Oy = 0. Hence for all private modules other than m/, the are equal 
with respect to Oj and therefore equivalent. 

(B) For m,-, from dS]), 

n/,(d,-,^) = EFLlPY,z;M(n,,(d,,„))(n/t(d,-,„))- Here n/,(d,-,„,) and n/,(d,-^) may differ on 4 only if 
M(n4(d,>)) G {wyjWj.}. By Corollary cor:out-equiv, y =//. z, i.e. n/j,(y) n/j.(z). But since M is 
UDS, by the downstream-safety property, Wy =Hi w^. Then by the upstream-safety property, all inputs 
n/j(d,v„) =H, y =H, z such that M(n4(d,>)) e {wy,w,}. In particular, if M(n4(dr,«i)) = Wy, then 
n4.(d,-^) = n4(z), and 

ri/j. (z),n4(d;^m) will be equivalent with respect to Hi. Similarly, if M(n/j.(d,>)) = w^, then n4(d,- m) = 
n/j(y), and n/j.(y),n4.(d,>) will be equivalent with respect to //,. Sotj' are equivalent with respect 
io h\Hi. 

Next we argue that t,t' are equivalent with respect to {Oi\Ik) \ Hi. From ([5]), 



no,\4(d,-,m) and no,\4(d,-,„,) can differ only if 
no,\4(d,>) =no,\4(y). Then 

no,\4 (d,;^) = no,\i, (z), or, no,\i, {di^m) = no,\/, (z). Therefore, no,\i, (d,- = n^y, (y). But no,\i, (y) 
and rTo y^ (z) are equivalent with respect to H,. Hence H^.y^. (d;^m) and H^.y^, (d/.m) are equivalent with 
respect to Hi. 

Hence t,t' are equivalent on O,. 

(C) When mj is a public module, dy „ = FLlPY,z(dy,m)- Here dy,m;dy ^ can differ only on {Ok U O,) fl Oj. 
If y 7^ k, the intersection is empty, and we are done. If j = k, dj^m , dy ^ may differ only if dy G 
{wyjW^}. But note that y =/,, z, so Hj^{y) =/,, n4(z), and ^[^(y) =//, n^lz)- Since /Mjt is UDS, for 
these two equivalent inputs the respective outputs Wy,w^ are also equivalent. Hence in all cases the 
values of t,t' on Ok are equivalent. 

Combining (A), (B), (C), the projections of t,t' on Oj are equivalent for all I < j <n; i.e. t,t' are equivalent 
with respect to //, 



Vj 7^ /, 1 < J < «,dy_,^ = FLIPY,z(dy,m) 

^h{di,m) = EFLlPY.z;M(n,^(d,-.,„))(n4(d,-,m)) 

^0,\h{di,m) = FLIPY.z(no,\4.(d,-^)) 



(4) 
(5) 
(6) 



no,\/A.(d,'.m) = FLiPY,z(no,\4(d,>)) 
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Proof of Q, © and dH). The base case follows for j = I. If mi ^ m,- (mi can be public or private, but 
k^l since is its predecessor), then /i n {OiUOk) = 0, so for all input u, mj(u) = nij(FLlPY.z(u)) = mj{u) 
(if m\ is private) and mj{u) = mj{u) (if m\ is public). Since the inputs Ci ^ = Ci^„, (both projections of initial 
input p on /i), the outputs di^ff, = di^,„. This shows dU). If mi = m,-, the inputs are the same, and by definition 
of mi, 

n4(di_,^) = n4(mi(ci,^)) 

= EFLIPY,Z;M(n;,(mi(ci,ffi)))(n4(mi(cL^))) 

= EFLIPY,Z;M(n;, (mi(ci,„))) i^h ('"l («!,«))) 

= EFLiPy z;.M(n;j (di,„,)) (n^ (di,m)) 
This shows ([5]) for / = 1. Again, by definition of mi, 

noi\4.(dLm) = no,\4(mi(ci_,9)) 

= FLlPY,z(no,\4(mi(ci,,9)) 
= FLlPY,z(no|\4(mi(ci,„,)) 
= FLiPY^z(no,\4(di,„,)) 

This shows 

Suppose the hypothesis holds until j — 1, consider mj. From the induction hypothesis, if Ij n O,- = {mj 
does not get input from m,) then Cy ,f, = FLlPY,z(Cj,m), hence Cy ,,, = FLlPY,z(Cj,m) (see Observation [T][T1)). 

(i) If j = i, li n Oi = 0, hence c,.,^ = FLIPY,z(c;,,n) = c,\m n (O; U Ok) = 0, m<. is a successor of m,-, so 
m,- cannot be successor of nik). By definition of m,, 

n7,(d,-a) = n7,(m;(C/_^)) 

= EFLIPY,Z;M(n,^ (m,(c,-,ffi))) i^k (^1 (c^m)) ) 

= EFLIPY,Z;M(n;^ (m, (c,„))) (H/, (m; (c,> ) ) ) 

= EFLIPY,Z;M(n,j (d,>,)) (^4 (di,m)) 

This shows (l5]l. 
Again, 

= FLIPY,z(no,\4('«i(c,>))) 

= FLIPY,z(no,y,('«i(CM«))) 
= FLIPY,z(no,\4(d,>)) 

This shows 

(ii) If j = k, rrik gets all its inputs from m,-, so n4(d,.m) = C/t,m- Hence 

C/t,m = EFLIPY,Z;M(n;^(d,-.„))(c<:,m) 
= EFLIPY,Z;M(ct.„, ) {^k,m ) 
= EFLIPY,Z;dt,„(c^,m) 
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Therefore, 



= mj;(EFLIPY,Z;dt,„,(c*:,m)) 

Lets evaluate the term m,t(EFLlPY,Z;d*^„, (c/t,m))- This says that for an input to mt is Ck.m, and its output 
dk,m, (a) if dkjn = Wy, then 

EFLIPY,Z;d,,„,(C/t,m) =n4.(z), 

and in turn 

dk,ih = '«/t(EFLIPY,Z;dt.,„(C/t,m)) = "^z'^ 

(b) if dk^m = w^, then 

EFLIPY,Z;di,,„(C/t,«,) = ^h{y), 

and in turn 



dk,fn = n^/t(EFLIPY,Z;dt,„(Cyt,m)) = Wy', 

(c) otherwise 

dk,m = 'W«:(EFLIPY,Z;d^,„(C/t,m)) 
= mk{Ck,m) = dk^m 

According to Definition [lOl the above implies that 



= FLIPY,z(di,,„) 

This shows (HJl. 

) If 7 7^ / and mj is a private module, mj can get inputs from m,-. (but since there is no data sharing 
IjCiIk = 0)> and other private or public modules m(,i / / (i can be equal to k). Let us partition the 
input to ifij (Cj^m and ^ defined on Ij) on attributes Ij n O,- and Ij \ Oi From Q, using the induction 
hypothesis, 

n/,\o, (C;,™) = FLIPY,z(n/^.\0, (C;,m)) (7) 

Now 4 n/y = 0, since there is no data sharing. Hence {Ij fl Oi) C {Oi\Ik)- From ^ using Observa- 
tion El 
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n/^no,(C;,m) =FLlPY,z(n/^no,(Cj,m)) (8) 
From (O and dSjl, using Observation [T]©, and since Cj^m,*^j,fh are defined on /y, so 

Cy,;5i = FLIPY,z(C;>) (9) 

From 

= my(FLIPY,z(C;Vn)) 
— d jm 

= FLIPY,z(dy,,„) 

FLlPY,z(d;>)) = dy,m follows due to tiie fact tiiat Oj n (O,- U Ok) = 0' 7^ Y,Z are defined on 

Oi U O^:, wiiereas dy_,„ is defined on Oy (again see Observation [T]©). 

(iv) Finally consider mj is a public module such that j ^ k. nij can still get input from m,, but none of the 
attributes in /y n O,- can be hidden by the definition of = M = C{hi). Further, by the definition of 
M = mk, mj cannot get any input from mk (M is the closure of public module); so Ij n Ok = 0- Let us 
partition the inputs to my (cy „, and Cy „ defined on /y) into three two disjoint subsets: (a) /y n Oj, and 
(b) Ij \ Oj. Since there is no data sharing 4 n /y = 0, and we again get ^ that 

Cj^,Ti = FLIPY,z(Cy,m) 
— Cy ,)j 

FLiPy ^(cy m) = Cy ,„ follows duc to foUowing reason. If Ij n Oi = 0, i.e. if mj does not get any 
input from m,-, again this is true (then Ij n (O; U Ok) = {Ij H O,) U (/y PI Ok) = 0). If mj gets an input 
from Mi, i.e. /y n O,- 7^ 0, since j 7^ /y n O,- does not include any hidden attributes from hj (nik is 
the closure C(/j,)).But y =/,. z, i.e. the visible attribute values of y,z are the same. In other words, 
n/^no, (y) = n/^no, (z), and again from Observation [T](l5]l, 

FLIPY,z(Cy,m) = FLIPy,z(cy,m) = Cy,,„ 

(again, Ij nOk = 0). 
Therefore, 

dy,m = ^^j{^j,m) 

= nijlcjffi) 

= my(FLIPY,z(Cy,m)) 
= my(cy,m) 
= dy,,„ 

= FLIPY.z(dy,m) 

FLIPY,z(d j,m) — dy^m again follows due to the fact that Oj n (O, U O^:) — 0, since j ^ {i,k}. 

Hence all the cases for the induction hypothesis hold true, and this completes the proof of the lemma for 
Case-II. □ 
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A.3 Proof of Lemma |2] 



Recall that 0;,A, denote the set of input, output and all attributes of a module m,. 

Lemma [21 Let M be a composite module consisting only of public modules. Let H be a subset of hidden 
attributes such that every public module mj in M is UD-safe with respect to AjDH. Then M is UD-safe 
with respect to [I UO)nH. 

Proof. Let us assume, wlog., that the modules in M are mi , • • • ,mp where modules are listed in topological 
order. For j = 1 to p, let be the composite module comprising nii , • • • ,mj, and let P , be its input and 
output. Hence MP = M,P = I and QP = O. We prove by induction on2<j<p that is UD-safe with 
respect to Hr\{V U O^). We present the proof without going through the notations for the sake of simplicity. 

The base case directly follows for j = 1, since A\ = I\\JO\ = \J0^ . Let the hypothesis hold until 
and consider M^+^ By induction hypothesis, is UD-safe with respect to {V UO-')nH. The module 
my+i may consume some outputs of (ni2 to nij). Hence 

= fUlj+i \ Qj and 0^'+^ = U Oj+i (10) 

Consider two equivalent inputs xi,X2 with respect to hidden attributes H n (7^+' U O-'^^)). Therefore 
their projection on visible attributes 7^+' \ H = {P^^ U O^^') \77 ai^e the same (A) 

Partition 7^+' into P and 7^+^ \7^ = 7y+i \7A Projection of xi and X2, let jcii,.x;i2, on P\H will be the 
same. Therefore, the inputs to are equivalent. By hypothesis, their outputs, say z\,Z2 will have same 
values on \H = {P+^ U 0^+^) \H (B). 

Again, on inputs xi,X2 to M^+^, inputs to mj^i will be concatenation of (i) projection of output zi,Z2 
from on 0-' Hlj+i and (ii) projection of x\ ,X2 on 7y+i \ P'. From (A) and (B), they will be equivalent on 
visible attributes (7^+^ U 0^+^) \77. Therefore, the inputs to nij+i are equivalent with respect to 77. Since 
mj^i is UD-safe, the outputs, say wi,W2 are also equivalent (C). 

Now note that yi is defined on 0^+^ = [0^ \Ij+i) U Oj+i. Its projection on \Ij+i is projection of 
Zi on Qj \ Ij+i, and its projection on Oj+i is zi. Similarly j2 can be partitioned. From (B) and (C), the 
projections are equivalent, therefore the outputs yi and y2 are equivalent. 

This shows that for two equivalent input the outputs are equivalent. The other direction, for two equiva- 
lent outputs all of their inputs are equivalent can be proved in similar way by considering modules in reverse 
topological order from m^ to m2. □ 

B Proofs from Section 5 
B.l Proof of Theorem |3] 

Theorem \3\ Given public module mj with k attributes, and a subset of hidden attributes 77, deciding 
whether mj is UD-safe with respect to 77 is coNP-hard in k. Further, all UD-safe subsets can be found 
in EXP-time in k. 



36 



Proof of NP-hardness 

Proof. We do a reduction from UNSAT, where given n variables xi , • ■ • ,x„, and a boolean formula f{x\ , • • • ,x„), 
the goal is to check whether / is not satisfiable. In our construction, m,- has n + \ inputs xi, • • • ,x„ and y, 
and the output is z = mi{x\ , • • • ,x„,j) = /(xi , • • • ,x„) My (OR), hidden attributes // = {xi , • • • ,x„}, so y,z atr 
visible. We claim that / is not satisfiable if and only if ni,- is UD-saf e with respect to H. 

Suppose / is not satisfiable, so for all assignments of xi , • • • ,x„, f{x\ , • • • ,x„) = 0. For output z = 0, 
then the visible attribute y must have value in all the rows of the relation of m,. Also for z = 1, the visible 
attribute y must have 1 value, since in all rows /(xi, • • • ,x„) =0. Hence for equivalent inputs with respect 
to H, the outputs are equivalent and vice versa. Therefore m, is UD-saf e with respect to H. 

Now suppose / is satisfiable, then there is at least one assignment of xi , • • • ,x„, such that f{x\ , • • • ,x„) = 
1. In this row, for y — 0, z — 1. However for all assignments of xi , ■ • ■ ,x^, whenever 3^ = 1, z = 1. So for 
output z = 1 , all inputs producing z are not equivalent with respect to the visible attribute y, therefore m, is 
not upstream-safe and hence not UD-safe. □ 

Upper Bound to Find All UD-saf eSolutions The lower bounds studied for the second step of the four 
step optimization show that for a public module mj, it is not possible to have poly-time algorithms (in \Aj\) 
even to decide if a given subset H C Aj is UD-safe, unless P = NP. Here we present Algorithm [H that 
finds all UD-safe solutions of trij is time exponential in kj = \Aj\, assuming that the maximum domain 
size of attributes A is a constant. 

Time complexity. The outer for loop runs for all possible subsets of Aj, i.e. 2^' times. The inner for 
loop runs for maximum a'^j^^I times (this is the maximum number of such tuples x+), whereas the check 
if // is a valid downstream-safe subset takes O(aI^^^^I) time. Here we ignore the time complexity to check 
equality of tuples which will take only polynomial in [A, [ time and will be dominated by the exponential 
terms. For the upstream-safety check, the number of (x+,y+) pairs are at most aI^A^I, and to compute the 
distinct number of x+,y+ tuples from the pairs can be done in 0(A^I^'\^l) time by a naive search; the time 
complexity can be improved by the use of a hash function. Hence the total time complexity is dominated by 
2''' X O(Al^Affl) X 0{^\^'^^\ +A2|^Aff|) = 0{2^'A^^') = 0{2'^^'). By doing a tighter analysis, the multiphcative 
factor in the exponents can be improved, however, we make the point here that the algorithm runs in time 
exponential in kj = \Aj \ . 

Correctness. The following lemma proves the correctness of Algorithm [T] 

Lemma 4. Algorithm\J\adds H C Ay to Uy if and only ifrnj is UD-safe with respect to H. 

Proof, (if) Suppose // is a UD-safe subset for my. Then V is downstream-safe, i.e. for equivalent inputs 
with respect to the visible attributes Ij\H, the projection of the output on the visible attributes Oj \ H will 
be the same, so H will pass the downstream-safety test. 

Since H is UD-safe, H is also upstream-safe. Clearly, by definition, «i > «2- Suppose ni > ?i2- Then 
there are two x^ and that pair with the same y+. By construction, x^ and (and all input tuples x to nij 
that project on these two tuples) have different value on the visible input attributes /y \ H, but they map to 
outputs y-s that have the same value on visible output attributes Oj \ H. Then H is not upstream-safe, which 
is a contradiction. Hence «i =n2, and H will also pass the test for upstream-safety and be included in Uy. 

(only if) Suppose H is not UD-safe, then it is either not upstream-safe or not downstream-safe. Sup- 
pose it is not downstream-safe. Then for at least one assignment x+, the values of y generated by the assign- 
ments x^ will not be equivalent with respect to the visible output attributes, and the downstream-safety test 
will fail. 
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Algorithm 1 Algorithm to find all UD- safe solutions U^- for a public module nij 
-SetU^ = 0. 

for every subset H of Aj do 
/*Check ifH is downstream-safe */ 

for every assignment x+ of the visible input attributes in Ij \H do 

-Check if for every assignment x~ of the hidden input attributes in Ij fl V, whether the value of 
no^.\//(my(x)) is the same, where Ili.\fj{x) = x+ and Il/^nff (x) = x~ 
if not tlien 

-His not downstream-safe. 

- Go to the next H. 
else 

-H is downstream-safe. 

- Let y"*" = no^\//(mj(x)) = projection of all such tuples that have projection = x"*" on the visible 
input attributes 

- Label this set of input-output pairs (x,m^(x)) by (x+,y+). 
end if 

/*Check ifH is upstream-safe */ 

- Consider the pairs (x+,y+) constructed above. 

- Let ni be the number of distinct x+ values, and let n2 be the number of distinct y+ values/ 
if n\ == n2 then 

-H is upstream-safe. 

- Add H to U;. 
else 

-H is not upstream-safe. 

- Go to the next H. 
end if 

end for 
end for 

return The set of subsets U,-. 



Suppose H is downstream-safe but not upstream-safe. Then there are Then there are two x^ and that 
pair with the same y"*". This makes ni > n2, and the upstream-safety test will fail. □ 

B.2 Correctness of Optimal Algorithm for Chain Workflows 

Recall that after renumbering the modules, mi,- - ,mjt denote the modules in the public closure C of a 
private module m,. The following lemma shows that Q[j,£] correctly stores the desired value: the cost of 
minimum cost hidden subset H^^ that satisfies the UD-saf e condition for all pubUc modules mi to my, and 
Aj n H^^- = Uji G Uy. Recall that we use the simpUfied notations 5 for the safe subset Sa of m,-, C for pubUc 
closure C{Sie), and H for 

Lemma 5. For I < j <k, the entry Q.[j,i\, 1 < ^ < Pj, stores the minimum cost of the hidden attributes H^^ 
such that U^^jAjc D H^^ D S, AjDH-l^ = Uj£, and every module mx, <x< j in the chain is UD-saf ewith 
respect to Ax 
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The following proposition will be useful to prove the lemma. 

Propositions. For a public module mj, for two UD-saf e hidden subsets Ui,U2 ^Ay, ifU\r\0j = U2r\0j, 
then U\ = U2- 

The proof of the proposition is simple, and therefore is omitted. 
Proof of Lemma m 

Proof. We prove this by induction from j = 1 to k. The base case follows by the definition of 2[1,^], for 
1 < ^ < Pi- Here the requirements are A\ ^ //'^ 5 S, and //'^ = U\^. So we set the cost at 2[1,^] to 
c{Uu) = c{H^^),ifUu^S . 

Suppose the hypothesis holds until j—l, and consider j. Let H^^ be the minimum solution s.t. AjHRj^ = 
Uji and satisfies the other conditions of the lemma. 

First consider the case when there is no q such that Uj-i^q n Oy-i = Uj^iHlj, where we set the cost to 
be oo. If there is no such q. i.e. for all q < pj-i, then clearly there cannot be any solution H-l^ that contains 
Uj^i and also guarantees UD-saf e properties of all x < j (in particular for x = j — 1). In that case the cost 
of the solution is indeed oo. 

Otherwise (when such a q exists), let us divide the cost of the solution c{H-i^') into two disjoint parts: 

c{H^'^) = c{Hi^ n Oj) + c{Hj^ \ Oj) 

We argue that c{Oj D W'^) = c{Oj n Uji). Ay n //^'^ = U^'^. Then Oj n Uji = Oy n Ay n = Oj n H^^ 
since Oj C Ay. Hence c{Ojr\H^^) = c{Oj n Uj^). This accounts for the cost of the first part of 

Next we argue that c{H^^ \ Oj) = minimum cost Q[j — 1,^], 1 < ^ < Pj, where the minimum is over 
those those q where ?7y_i^^nOy_i = Uj^iHlj. Due to the chain structure of C, OynUf^iAy = 0, and 
Oj U Ut 1 A, = U=i Ax- Since ul^.A, D HJ^ \ Oj = n A,. 

Consider H' = H^^ n Ui=i A^. By definition of H^^, H' must satisfy the UD-saf e requirement of all 
1 < < 7 - 1- Further, [j{z\A^ 5 H' . AjDH}^' = Uj^e, hence Ujj C HJ^. 

We are considering the case where there is a ^ such that 

t/y_l,^nOy_i=[/y,£n/y (11) 

Therefore 

Uj-uinOj-iCUj,eQHj^ 

We claim that if q satisfies ([TT]) . then Ay_i HH' = Uj-\^q. Therefore, by induction hypothesis, Q[i — 1,^] 
stores the minimum cost solution H' that includes Uj-i,q, and part of the the optimal solution cost c{H^^\Oj) 
for my is the minimum value of such Q[i — \ ,q]. 

So it remains to show that Ay_i nH' = Uj-i,q. Ay_i n//' =Ay_i n//-'^ G Uy_i, since //■'^ gives UD-saf e 
solution for my_i. Suppose Ay_i n//^^ = Uj-\^y. Then we argue that Uj-\,q = Uj-\^y, which will complete 
the proof. 

Uj-i,y n Oy_i = (Ay_i n//^'0 n Oy_i = n Oy_i, = Hi^-r\ij = (Ay n [/y,^ ) n/y, i.e. 

Uj-uynOj-i = Uj^enij (12) 

From ([IB and (fllTi . 

Uj-i,qnOj-i=Uj-i,ynOj-i 

since both Uj-\^q,Uj-\,y G Uy_i, from Proposition |3] Uj-\^q = Uj-\^y. This completes the proof of the 
lemma. □ 
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B.3 Optimal Algorithm for Tree Workflows 

Here we prove Theorem 5]for tree workflows. 

Optimal algorithm for tree workflows Similar to the algorithm for chain workflows, to obtain an algo- 
rithm of time polynomial in L for tree workflows, for a given module m,, we can go over all choices of safe 
subsets Si( G S,- of m,, compute the public-closure C{Sii), and choose a minimal cost subset Hi = Hi{Su) 
that satisfies the UD-saf e properties of all modules in the public-closure. Then, output, among them, a 
subset having the minimum cost. Consequently, it suffices to explain how, given a safe subset Si^ G S,-, one 
can solve, in PTIME, the problem of finding a minimum cost hidden subset Hj that satisfies the UD-saf e 
property of all modules in a subgraph formed by a given C{Sit). 

To simplify notations, the given safe subset Ste will be denoted below by S, the closure C{Sie) will be 
denoted by C, and the output hidden subset Hi will be denoted by H. Our PTIME algorithm uses dynamic 
programming to find the optimal H. 

First note that since C is the public-closure of (some) output attributes for a tree workflow, C is a collec- 
tion of trees all of which are rooted at the private module m,. Let us consider the tree T rooted at m, with 
subtrees in C, (note that m, can have private children that are not considered in T). Let k be the number of 
modules in T, and the modules in T be renumbered as mi,m\,- ■ ■ ,mk, where the private module m, is the 
root, and the rest are public modules. 

Now we solve the problem by dynamic programming as follows. Let 2 be an ^ x L two-dimensional 
array, where Q[i,l], I < j < k,l < £ < pj denotes the cost of minimum cost hidden subset H-i^ that (i) 
satisfies the UD-saf e condition for all public modules in the subtree of T rooted at nij, that we denote 
by Tj-, and, (ii) H^^ r}Aj = Uje. (recall that IjOj,Aj is the set of input, output and all attributes of mj 
respectively); the actual solution can be stored easily by standard argument. The algorithm is described 
below. 

• Initialization for leaf nodes. The initialization step handles all leaf nodes mj in T. For a leaf node 

'nj, I < i < Pj, 

Q[j,£] = c{Uj.e) 

• Internal nodes. The internal nodes are considered in a bottom-up fashion (by a post-order traversal), 
and Q[j,£] is computed for a node nij after its children are processed. 

For an internal node mj, let m,j , • • • be its children in T. Then for 1 < ^ < pj, 

1. Consider UD-saf e subset ?7y.f. 

2. For 3^ = 1 to X, let = Uj^edli^, Since there is no data sharing, U^-s are disjoint 

3. For y = I tox, 

F = argmin]jQ[iy,k] where the minimum is over 

i<k<pi^. s.t. Ui^.,kr\i,,=uy 

= _L (undefined), if there is no such k 

4. Q[i,l] is computed as follows. 

Q[iA = oc if33;,l <3;<x, F =± 

X 

= c{Ij n Uj() + ^ Q[iy,ky] (otherwise) 
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• Final solution for S. Now consider the private module m, that is the root of T. Recall that we fixed 
a safe solution S of m, for doing the analysis. Let m,j , • • • , m,^ be the children of m,- in T (which are 
public modules). Similar to the step before, we consider the min-cost solutions of its children which 
exactly match the hidden subset S of m,-. 

1. Consider safe subset S of m,. 

2. For y = I to X, let S^' = Snlj^. Since there is no data sharing, again, S^-s are disjoint 

3. For y = I tox, 

F = argmin]jQ[iy,k] where the minimum is over 
l<k<pi^. s.t. t/,-in/i,, 
= _L (undefined), if there is no such k 

4. The cost of the optimal H (let us denote that by c*) is computed as follows. 

c* = oo if Bj, 1 < J < X, F =± 

= ^2[/y,F] (otherwise) 

It is not hard to see that the trivial solution of UD-saf e subsets that include all attributes of the modules 
gives a finite cost solution by the above algorithm. 

Lemma |6] stated and proved below shows that correctly stores the desired value. Given this 

lemma, the correctness of the algorithm easily follows. For hidden subset H ^ S in the closure, for every 
public child m,;, of m,-, Hdli^ 5 SHli^, = S^'. Further, each such m,-^ has to be UD-saf e with respect to H^^. 
In other words, for each m,-^., // fl/,-,. must equal ?7,\iv for some 1 < F < pi^.. The last step in our algorithm 
(that computes c*) tries to find such a that has the minimum cost Q[iy,ky], and the total cost c* of H is 
Lm,,, Q[hjk^'] where the sum is over all children of m,- in the tree T (the trees rooted at m,-^, are disjoint, so the 
optimal cost c* is sum of those costs). This proves Theorem |4]for tree workflows 

Q[j,i] stores correct values. The following lemma shows that the algorithm stores correct values in 
Q[j,£] for all public modules mj in the closure C. 

Lemma 6. For \ < j <k, let Tj be the subtree rooted at mj and let kttj = Um,er,^g- entry Q[j,i\, 
1 < ^ < pj, stores the minimum cost of the hidden attributes H^^ C Atty such that A j CiH-'^' = Uj£, and every 
module m^ G Tj, is UD-saf e with respect to AqCiH-i^. 

To complete the proof of Theorem 5]for tree workflows, we need to prove Lemma|6l that we prove next. 

Proof. We prove the lemma by an induction on all nodes at depth h = H down to 1 of the tree T, where 
depth H contains all leaf nodes and depth 1 contains the children of the root m, (which is at depth 0). 

First consider any leaf node mj at height H. Then Tj contains only mj and Atty = Aj. For any 1 < 
^ < Pj, since Atty =Aj ^ H-i^ and Aj nH^^ = Ujj. In this case H^^ is unique and Q[i,P\ correctly stores 

c{Uj.i)=c{W'). 

Suppose the induction holds for all nodes up to height h + \, and consider a node mj at height h. Let 
m,, , • • • ,m,^. be the children of mj which are at height h + \. Let H^^ be the min-cost solution, which is 
partitioned into two disjoint component: 



41 



P1 



P" 






1^ 


C1 




Cm 



Figure 7: Reduction from 3SAT. White modules are public, Grey are private. Red thin edges denote TRUE 
assignment, Blue bold edges denote FALSE assignment. 



c{W^) = c{H}^ n Ij) + c{W^ \ Ij) 

First we argue that c{W^r\Ij) = c{Ujj:). AjDHJ^' = U^^. Then Ij D Uji = Ij riAjDHj^ = IjOHJ^, since 
Ij C Aj. Hence c(/y fl//^^) = c{Ij PI Uje). This accounts for the cost of the first part of Q[j,i]. 

Next we analyze the cost c{H^^\Ij). This cost comes from the subtrees Tj^, - ■ ■ ,Ti^ which are disjoint 
due to the tree structure and absence of data-sharing. Let us partition the subset H^^ \ Ij into disjoint parts 
{H^^\Ij) n Att,-^,, l<y<x. Below we prove that c{{H^'^\Ij) n Att;J = 2[/y,it>'], l<y<x, where ¥ is 
computed as in the algorithm. This will complete the proof of the lemma. 

To see this, let H' = {Hj^ \Ij) n Att,,,. Clearly, Att,-^, D H' . Every niq G Tj is UD-saf e with respect 
to AqfMI^^. If also mq G Tj^^, then AqfMI' = AqfMI^^ and therefore all niq G Ti^ are also UD-saf e with 
respect to H' . In particular, m,; is UD-saf e with respect to H' , and therefore A^ OH' = Ui p for some k?', 
since Ui^,^- was chosen as the UD-saf e set by our algorithm. 

Finally we argue that c{H') = c(//'>'*^' ), where H'y'^' is the min-cost solution for among all such 
subsets. This follows from our induction hypothesis, since m,; is a node at depth h+\. Therefore, c{H') = 
c{H'y''') = Q[iy,Pli.e. 

ci{W'\Ij)nAt^) = Q[iy,ky] 
as desired. This proves the lemma. □ 

B.4 Proof of NP-hardness for DAG Workflows 

Here we prove NP-hardness for arbitrary DAG workflows as stated in Theorem |4]by a reduction from 3SAT. 

Given a CNF formula y on n variables zi , • • ■ ,Zn and m clauses i/^i , • • • ,Ym^ we construct a graph as 
shown in Figure |7] Let variable Zi occurs in m, different clauses (as positive or negative literals). In the 
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figure, the module pQ is the single-private module (m,), having a single output attribute a. The rest of the 
modules are the public modules in the public-closure C{{a}). 

For every variable z,-, we create m/ + 2 nodes: pi,yi and Xj^i,- ■■ ,Xi^mi- For every clause i//^-, we create 2 
modules Cj and fj. 

The edge connections are as follows: 

1 . po sends its single output a to pi . 

2. For every / = 1 to n — 1, pi has two outputs; one is sent to and the other is sent to yt. pn sends its 
single output to y„. 

3. Each yi, / = 1 to n, sends two outputs to Xi^\. The blue outgoing edge from j,- denotes positive assign- 
ment of the variable zu whereas the red edge denotes negative assignment of the variable z,-. 

4. Each Xij, i=\io n, j = \ to m,- — 1, sends two outputs (blue and red) to Xij+i. In addition, if Xij, 
j = 1 to n, y = 1 to mi sends a blue (resp. red) edge to clause node Q if the variable zt is a positive 
(resp. negative) in the clause Q (and Q is the j-th such clause containing zt). 

5. Each Cj, 7 = 1 to m, sends its single output to fj. 

6. Each fj, j = 1 to m — 1, sends its single output to fm outputs the single final output. 
The UD-saf e sets are defined as follows: 

1. For every j = 1 to n — 1, pi has a single UD-saf e set: hide all its inputs and outputs. 

2. Each yi, j = 1 to n, has three UD-saf e choices: (1) hide its unique input and blue output, (2) hide its 
unique input and red output, (3) hide its single input and both blue and red outputs. 

3. Each Xij, i = \ Xo n, j = \ to m,-, has three choices: (1) hide blue input and all blue outputs, (2) hide 
red input and all red outputs, (3) hide all inputs and all outputs. 

4. Each Cj, j = 1 to m, has choices: hide the single output and at least of the three inputs. 

5. Each fj, J = 1 to m, has the single choice: hide all its inputs and outputs. 

Cost. The outputs from yt, i=\ton has unit cost, the cost of the other attributes is 0. The following 
lemma proves correctness of the construction. 

Lemma 7. There is a solution of single-module problem of cost = n if and only if the 3 SAT formula Xj/ is 
satisfiable. 

Proof, (if) Suppose the 3SAT formula is satisfiable, so there is an assignment of the variables Zi that makes 
*F true. If Zi is set to True (resp False), choose the blue (resp. red) outgoing edge from yi. Then choose 
the other edges accordingly: (1) choose outgoing edge from po, (2) choose all input and outputs of pi, i = 1 
to n; (3) if blue (resp. red) input of Xij is chosen, all its blue (resp. red) outputs are chosen; and, (4) all 
inputs and outputs of fj are chosen. Clearly, all these are UD-saf e sets by construction. 

So we have to only argue about the clause nodes Cj. Since y is satisfied by the given assignment, there 
is a literal Zi G Cj (positive or negative), whose assignment makes it true. Hence at least one of the inputs to 
Cj will be chosen. So the UD-saf e requirements of all the UD-saf e clauses are satisfied. The total cost 
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of the solution is n since exactly one output of the yj nodes, / = 1 to n, have been chosen. 

(only if) Suppose there is a solution to the single-module problem of cost n. Then each can choose ex- 
actly one output (at least one output has to be chosen to satisfy UD - s a f e property for each yi , and more than 
one output cannot be chosen as the cost is n). If yi chooses blue (resp. red) output, this forces the x, y nodes 
to select the corresponding blue (resp. red) inputs and outputs. No xij can choose the UD-saf e option of 
selecting all its inputs and outputs as in that case finally be forced to select both outputs which will exceed 
the cost. Since Cj satisfies UD-saf e condition, this in turn forces each Cj to select the corresponding blue 
(resp. red) inputs. 

If the blue (resp. red) output of j,- is chosen, the variable is set to True (resp. False). By the above ar- 
gument, at least one such red or blue input will be chosen as input to each Cy, that satisfies the corresponding 
clause y/j. □ 

C General Workflows 

In this section we discuss the privacy theorem for general workflows as outlined in Section[5] First we define 
directed-path and downward-closure as follows (similar to public path and public-closure). 

Definition 12. A module mi has a directed path to another module m2, if there are modules ni,-, , m,-^ , • • • , mi. 
such that mi^ = m\, m,v = m2, and for all \ <k < j, Oij^ H/^^j 7^ 0. 

An attribute a^A has a directed path from to module mj, if there is a module m^ such that a and 
mk has a directed path to my. 

Definition 13. Given a private module m,- and a set of hidden output attributes hi C O,- of mi, the downward- 
closure of mi with respect to hi, denoted by D{hi), is the set of modules my (both private and public) such 
that there is a directed path from some attribute a € hi to mj. 

Also recall downstream-safety (D-saf ety) defined in Definition [8] which says that for equivalent inputs 
to a module with respect to hidden attributes, the outputs must be equivalent. We prove the following theo- 
rem in this section: 



Tlieorem 5. (Privacy Theorem for General workflows) Let W be any workflow. For each private module 
TUi in W, let Hi be a subset of hidden attributes such that ( i) hi = Hi n Oi is safe for T-standalone-privacy 
of mi, (ii) each private and public module mj in the downward-closure D{hi) is D-saf e with respect 
to AjHHi, and (Hi) Hi C O; U Uy:m^eD(/!,)^7- Then the workflow W is T -private with respect to H = 

Ui:m, is private ^i- 

In the proof of Theorem |2] from Lemma [T] we used the fact that for single-predecessor workflows, for 
two distinct private modules m,,myt, the public-closures and the hidden subsets Hi,Hj are disjoint. However, 
it is not hard to see that this is not the case for general workflows, where the downward-closure and the 
subsets Hi may overlap. Further, the D-saf e property is not monotone (hiding more output attributes will 
maintain the D-saf e property, but hiding more input attributes may destroy the D-saf e property). So we 
need to argue that the D-saf e property is maintained when we take union of //,- sets in the workflow which 
is formahzed by the following lemma. 

Lemma S. If a module mj is D-saf e with respect to sets Hi,H2 ^ Ay, then mj is D-saf e with respect 
toH = HiUH2. 



44 



Given two equivalent inputs xi =h X2 with respect to // = //i U//2> we have to show that their outputs are 
equivalent: mj{x\) =u mj{x2). Even if xi,X2 are equivalent with respect to H, they may not be equivalent 
with respect to Hi or H2. In the proof we construct a new tuple X3 such that xi =//, X3, and X2 X3. Then 
using the D-saf e properties of Hi and H2, we show that mj(xi) =h mj{x3) =v mj{x2). The formal proof 
is given below. 

Proof. Let H = Hi U//2- Let xi and X2 be two input tuples to mj such that xi =h X2. i.e. 

n,^.\ff(xi)=n,^.\ff(x2) (13) 

For a G Ij, let xs[a] denote the value of a-th attribute of X3 (similarly xi [a],X2{a]). From ([T3T i. for a G Ij\H, 
xi [a] = X2[a\. Let us define a tuple X3 as follows on four disjoint subsets of Ij: 

X3 [a] = xi [a] if a e Ij n //i Pi //2 

= xi[a] ifaeIjn{H2\Hi) 

= X2 [a] if a e Ij n (Hi \ H2) 

= xi[a]=X2 [a] if a G Ij\H 

For instance, on attribute set Ij = {ai,--- ,05), let xi = (2,3,2,6,7), X2 = (4,5,9,6,7), Hi = {01,02} 
and H2 = {02,03}, H = {01,02,03} (in xi,X2, the hidden attribute values in H are underlined). Then 
X3 = (4,3,2,6,7). 

(1) First we claim that, xi =//j X3, or, 

n,^\H,{xi) = nj^\HXx3) (14) 

Partition Ij \ Hi into two disjoint subsets, Ij n {H2 \Hi ), and, Ij \ {Hi U H2) = Ij \ H. From the definition of 
X3, for all oGljD {H2 \ Hi ) and all o£lj\ H, xi [o] = X3 [o] . This shows (fT4l ). 

(2) Next we claim that, X2 X3, or, 

nj^\H2i^2) = n,^\HM) (15) 

Again partition Ij\H2 into two disjoint subsets, Ij n {Hi \H2), and, Ij \ {Hi UH2) = Ij\H. From the defini- 
tion of X3, for all a € /; n {Hi \ H2) and all o £ Ij\ H, X2 [o] = X3 [o] . This shows (fTSl ). (fT4l) and ([T5l) can also 
be verified from the above example. 

(3) Now by the condition stated in the lemma, nij is D-saf e with respect to Hi and H2. Therefore, 
using (fT4l) and (fTSl ). my(xi) my(x3) and my(x2) fnj{x3) or, 

^Oj\Hi {m j (xi ) ) = nOj\Hi {m y (X3 ) ) (1 6) 

and 

noA//2('«;(x2)) = nOj\H2{mj{xi)) (17) 
Since Oj\H = Oj \ (//i U //i) C 6>y \ //i , from ^ 

^Oj\Himj{xi)) = nOj\H{mj{x3)) (18) 
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Similarly, Oj\H <Z Oj \ H2, from ^ 

no,\//K(x2)) = no^.\ff (m/x3)) (19) 

From (dill and GUl, 

no,\H{mj{xi)) = no^\H{mjix2)) (20) 
In other words, the output tuples m^(xi),my(x2), that are defined on attribute set Oj, 

mj{xi)=Hmj{x2) (21) 

Since we started with two arbitrary input tuples xi =y X2, this shows that for all equivalent input tuples the 
outputs are also equivalent. In other words, my is D-saf e with respect to H = Hi U//2- D 

Along with this lemma, two other simple observations will be useful. 

Observation 3. 1. Any module mj is D-safe with respect to (hiding nothing maintains downstream- 
safety property). 

2. Ifmj is D-safe with respect to H, and ifH' is such that H C H', but Ij\H' = Ij\H, then mj is also 
D-safe with respect to H' (hiding more output attributes maintains downstream- safety property). 

C.l Main Lemma for Privacy Theorem for General Workflows 

The following lemma is the crucial component in the proof of Theorerr|5] and is analogous to Lemma [T] for 
single-predecessor workflows. 

Lemma 9. Consider a standalone private module m,-, a set of hidden attributes hi, any input x to m,-, and 
any candidate output y € OUTx,m,,/!, ofx. Then y € OUTx,w,//i when m,- belongs to an arbitrary (general) 
workflow W, and a set attributes Hi C A is hidden such that ( i) hi C Hi, ( ii) only output attributes from Oi 
are included in hi (i.e. hi C Oi), and (Hi) every module mj in the downward-closure D{hi) is D-safe with 
respect to Aj Hi. 

Proof. We fix a module ni,, an input x to m,, a set of safe hidden attributes hi, and a candidate output 
y € OUTx,m;,/i, for x. For simplicity, let us refer to the set of modules in D(/j,) by D. We will show that 
y G OUTx,w,//, where the hidden attributes Hi satisfies the conditions in the lemma. In the proof, we show 
the existence of a possible world R' € Worlds (/?, Hi), such that if Hj. (t) = x for some t G then Ilo. (t) = y. 
Since y G OUT;t,m,,/), . by Lemma[3l y =/,, z where z = ni,(x). 

We will use the Flip function used in the proof of Lemma [T] (see Appendix IA.2I ). We redefine the 
module m,- to m, as follows. For an input u to m,, m;(u) = FLiPy 2(m;(u)). All other public and private 
modules are unchanged, my = my. The required possible world R' is obtained by taking the join of the 
standalone relations of these my-s, j G [n]. 

First note that by the definition of m,-, m,(x) = y (since m;(x) = FLiPy 2(m;(x)) = FLiPy ^(z) = y, from 
Observation [T])- Hence if n/.(t) = x for some t G R', then no_ (t) = y. 

Next we argue that R' G Worlds (R, Hi). Since R' is the join of the standalone relations for modules my-s, 
R' maintains all functional dependencies Ij Oj. Also none of the public modules are unchanged, hence 
for any public module my and any tuple t in R' , Yloj{i) = my (11/^(1)). So we only need to show that the 
projection of R and R' on the visible attributes are the same. 

Let us assume, wlog. that the modules are numbered in topologically sorted order. Let /q be the initial 
input attributes to the workflow, and let /? be a tuple defined on /q. There are two unique tuples t G /? and 



46 



t' G R' such that Ylj^ (t) = IT/i (t') = p. Note that any intermediate or final attribute a G A \ /o belongs to Oj, 
for a unique j G [l,n\ (since for j / i, Oj DOi = (j)). So it suffices to show that t,t' projected on Oj are 
equivalent with respect to visible attributes for all module j, 7 = 1 to n. 

Let Cj ,„,Cj ,f, be the values of input attributes Ij and dy „,,dy ,^ be the values of output attributes Oj of 
module nij, in t G /? and t' G R' respectively on initial input attributes p (i.e. ^ = 11/ (t), ^ = IT/ (t'), 
dy ,„ = no^ (t) and dy ^ = no^ (t')). We prove by induction on j = 1 to « that 



If the above is true for all j, then no^ (t) =//, Iloj{t), along with the fact that the initial inputs p are the 
same, this implies that t =//, t'. 

Proof of (HH) and (23\) . The base case follows for j = 1. If mi ^ nii (rrij can be public or private), 
then /i n Oi = 0, so for all input u, my(u) = my (FLlPy^z(u)) = my(u). Since the inputs Ci ^ = Ci^m (both 
projections of initial input p on /i), the outputs di^ = ^i,m- This shows (l23l) . If mi = m,-, the inputs are the 
same, and by definition of mi, di ,5; = mi(ci m) = FLlPy^2(m,(ci m)) = FLlPy^z(mj(ci^m)) = FLlPy^x(di,m)- 
Since y,z only differ in the hidden attributes, by the definition of the Flip function di^ di ^. This 
shows (I22I ). Note that the module mi cannot belong to D since then it will have predecessor m, and cannot 
be the first module in topological order. 

Suppose the hypothesis holds until j — 1, consider my. There will be three cases to consider. 

(i) If j = i, for all predecessors of m, (Ok H /; / 0), /: 7^ / and nik ^ D, since the workflow is a 
DAG. Therefore from (l23T i. using the induction hypothesis, c,^ = Ci.m- Hence d,m = ^ii'^i^m) = 
FLlPy,2(m;(c;^^)) = FLlPy,^(m;(c,-,„,)) = FLlPy^z(d,;m). Again, y,z are equivalent with respect to Hi, 
so di fn =Hi ^i.m- This shows (l22l ) in the inductive step. 

(ii) If j / / (my = my) and my ^ D, then my does not get any of its inputs from any module in D, or 
any hidden attributes from m, (then by the definition of D, nij G D). Using IH, from (1231 ) and from 
((22)) . using the fact that y,z are equivalent on visible attributes, Cy ^ = Cy ^- Then dy ^^j = my(cy ,9) = 
fnj{Cjjn) = djj„. This shows (1231 ) in the inductive step. 

(iii) If j / /, but my G D, nij can get all its inputs either from m,, from other modules in D, or from 
modules not in D. Using the IH from (l22l ) and (|23] ). Cy =//, Cy^„,. Since my G D, by the condition of 
the lemma, my is D-saf e with respect to Therefore the corresponding outputs dy^ = nij{Cjm) 
and dy_„, = mj{cj,m) are equivalent, or dy ,7; =//, dy This again shows (l22l ) in the inductive step. 

Hence the IH holds for all j = I ton and this completes the proof of the lemma. □ 

C.2 Proof of Theorem E 

Finally, we prove Theorem fusing Lemmas [8] and |9] 

of Theorem^ We argue that if Hi satisfies the conditions in Theorem |5l then /// = |J,-m, is private^! satisfies 
the conditions in Lemma |9] The first two conditions are easily satisfied by ///: (i) /i, C Hi C Hi and (ii) 
hi C Oi. So we need to show (iii), i.e. all modules in the downward-closure D{hi) are D-saf e with respect 
to AjnHl. 




if j = i or my G D 
otherwise 



(22) 
(23) 
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From the conditions in the theorem, each module mj € D{hi) is D-saf e with respect to Aj n//,. We 
show that for any other private module mk 7^ m,-, nij is also D-saf e with respect to AjdHk. There may be 
three such cases as discussed below. 

Case-I: If mj G D{hk), by the D-saf ety conditions in the theorem, mj is D-safe with respect 
to A jfMIk. 

Case-II: If mj ^ D{hk) and mj ^ nik, for any private module nik 7^ m,-, Aj CiHk = ® (since C 
Ok^[j£eD{hk)'^e from the theorem). From Observation |3] nij is D-safe with respect to AjdHk- 

Case-Ill: If mj ^ D{hk) but mj = (or j = k), then r\Aj C Oj (again since HkCOkU [jieD{hi,)^c. 
and Ok = Oj). From Observation |3] mj is D-safe with respect to 0, and AjdHk 5 0. Further, /y \0 = Ij = 
Ij \ {Aj n Hk). This is because CiAj C Oj, since n = 0, Ij n (A^ n Hk) = 0. Hence from the same 
observation, mj is D-safe with respect to AjCtHk- 

Hence mj is D-saf e with respect to A jHHi and for all private modules nik, ^ m,, nij is D-saf e with 
respect toAjdHk. By LemmaHl then is D-saf e with respect to (Ay n//,) U (A^ n///t) = Ay n 
By a simple induction on all private modules nik, nij is D-safe with respect to Aj n (U/tnu. is private)^*: - 
AjCiH-. Hence Hj satisfies the conditions stated in the lemma. The rest of the proof follows by the same 
argument as in the proof of Theorem |2] □ 
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